Failsafe RetryPolicy instrumentation added #15255

onurkybsi · 2025-11-10T06:01:57Z

Library instrumentation added for Failsafe's RetryPolicy.

laurit · 2025-11-14T16:33:39Z

.../library/src/main/java/io/opentelemetry/instrumentation/failsafe/v3_0/FailsafeTelemetry.java

+            .build();
+    LongHistogram attemptsHistogram =
+        meter
+            .histogramBuilder("failsafe.retry_policy.attempts")


I'm not sure using a histogram for this is justified. @trask could you provide guidance on this

don't know why I didn't thread this: #15255 (comment)

trask · 2025-11-19T04:17:11Z

.../library/src/main/java/io/opentelemetry/instrumentation/failsafe/v3_0/FailsafeTelemetry.java

+            .setDescription("Histogram of number of attempts for each execution.")
+            .ofLongs()
+            .setExplicitBucketBoundariesAdvice(
+                LongStream.range(1, userConfig.getMaxAttempts() + 1)


@onurkybsi what's typical userConfig.getMaxAttempts()?

could you come up with a smallish static set, e.g. 1, 2, 5, 10, 20, 50?

also worth reading open-telemetry/semantic-conventions#316 (comment)

Hey @trask, userConfig.getMaxAttempts() returns the user configured maximum attempts allowed for the retry policy execution. So, if this value is 3, the possibilities would be like [1(execution succeeded without retry), 2(first retry), 3(last attempt as configured)]. And what is implemented is using this fact, i.e, one by one between 1 and the maximum attempt.

I didn't take having enormous numbers into the account maybe. Do you think we should? If so, I can refactor this part to build up a list which distributes the range(1 to maxAttempt) evenly considering a maximum number of buckets like 10. Maybe something like this:

private static List<Long> buildBoundaries(int maxNumOfBuckets, long maxNumOfAttempts) { List<Long> boundaries = new ArrayList<>(maxNumOfBuckets); boundaries.add(1L); double step = (double) (maxNumOfAttempts - 1) / (maxNumOfBuckets - 1); for (int i = 1; i < maxNumOfBuckets; i++) { long boundary = Math.min(Math.round(1 + step * i), maxNumOfAttempts); boundaries.add(boundary); } return boundaries.stream() .distinct() .sorted() .toList(); }

What do you think?

buckets are costly, so I'd try to keep the number small if possible, e.g. with gc duration metrics, we went with just 5 buckets: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/runtime/jvm-metrics.md#metric-jvmgcduration

do you have any idea what are typical values for userConfig.getMaxAttempts()?

It's 3 as default in Failsafe and same for resilience4j. I think it wouldn't make sense to have a value more than 5 in most of the cases so maybe just [ 1, 2, 3, 5 ]. What do you say?

Sounds good

jaydeluca · 2025-11-26T21:58:12Z

.../library/src/main/java/io/opentelemetry/instrumentation/failsafe/v3_0/FailsafeTelemetry.java

+                "Count of execution events processed by the retry policy. "
+                    + "Each event represents one complete execution flow (initial attempt + any retries). "
+                    + "This metric does not count individual retry attempts - it counts each time the policy is invoked.")


just an idea, but perhaps we could minimize the length of this description a bit by encoding some of this information in the unit?

Suggested change

"Count of execution events processed by the retry policy. "

+ "Each event represents one complete execution flow (initial attempt + any retries). "

+ "This metric does not count individual retry attempts - it counts each time the policy is invoked.")

"Count of execution events processed by the retry policy.")

.setUnit("{policy_invocation}")

jaydeluca · 2025-11-26T21:59:31Z

.../library/src/main/java/io/opentelemetry/instrumentation/failsafe/v3_0/FailsafeTelemetry.java

+    LongHistogram attemptsHistogram =
+        meter
+            .histogramBuilder("failsafe.retry_policy.attempts")
+            .setDescription("Histogram of number of attempts for each execution.")


Suggested change

.setDescription("Histogram of number of attempts for each execution.")

.setDescription("Number of attempts for each execution.")

jaydeluca · 2025-11-27T01:34:27Z

...rary/src/test/java/io/opentelemetry/instrumentation/failsafe/v3_0/FailsafeTelemetryTest.java

+    }
+
+    // then
+    testing.waitAndAssertMetrics("io.opentelemetry.failsafe-3.0");


is there a reason not to use the other style of assertions here?

testing.waitAndAssertMetrics( "io.opentelemetry.failsafe-3.0", metricAssert -> metricAssert .hasName("failsafe.retry_policy.execution.count") ... etc

Nothing special, I was using the same style and after some point I think it was overcomplicating what I was trying to do and I switched to this style which I found easier to read.

The usual way we do that @jaydeluca pointed out waits until the assertion succeeds. For example when the method is called and the metric data points aren't available yet (data is exported from a background thread) it will retry the assertions after waiting a bit.
The way you write it testing.waitAndAssertMetrics("io.opentelemetry.failsafe-3.0"); waits for any metric data to be available. The following code assumes that you have exactly 2 metrics. I find it somewhat hard to reason whether this is guaranteed or not. Probably it is, because no other metrics should be generated, but hard to tell whether there could be something else that could affect this. That is why I believe it is best to write the assertions the same way as they are used elsewhere.

onurkybsi force-pushed the retry-policy branch from 17f361b to ee4ab3d Compare November 11, 2025 05:33

onurkybsi marked this pull request as ready for review November 11, 2025 05:33

onurkybsi requested a review from a team as a code owner November 11, 2025 05:33

laurit reviewed Nov 14, 2025

View reviewed changes

Failsafe RetryPolicy instrumentation added

e0715f9

onurkybsi force-pushed the retry-policy branch from ee4ab3d to e0715f9 Compare November 18, 2025 05:45

trask reviewed Nov 19, 2025

View reviewed changes

onurkybsi added 2 commits November 24, 2025 06:45

Review comments addressed

d2e8c60

Minor fix

1024227

jaydeluca reviewed Nov 27, 2025

View reviewed changes

Review comments addressed

8456c05

	.setDescription("Histogram of number of attempts for each execution.")
	.setDescription("Number of attempts for each execution.")

Failsafe RetryPolicy instrumentation added #15255

Are you sure you want to change the base?

Failsafe RetryPolicy instrumentation added #15255

Uh oh!

Conversation

onurkybsi commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onurkybsi Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onurkybsi Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

onurkybsi commented Nov 10, 2025 •

edited

Loading

onurkybsi Nov 19, 2025 •

edited

Loading

onurkybsi Nov 20, 2025 •

edited

Loading