-
Notifications
You must be signed in to change notification settings - Fork 1k
Failsafe RetryPolicy instrumentation added #15255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
17f361b to
ee4ab3d
Compare
| .build(); | ||
| LongHistogram attemptsHistogram = | ||
| meter | ||
| .histogramBuilder("failsafe.retry_policy.attempts") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure using a histogram for this is justified. @trask could you provide guidance on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't know why I didn't thread this: #15255 (comment)
ee4ab3d to
e0715f9
Compare
| .setDescription("Histogram of number of attempts for each execution.") | ||
| .ofLongs() | ||
| .setExplicitBucketBoundariesAdvice( | ||
| LongStream.range(1, userConfig.getMaxAttempts() + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@onurkybsi what's typical userConfig.getMaxAttempts()?
could you come up with a smallish static set, e.g. 1, 2, 5, 10, 20, 50?
also worth reading open-telemetry/semantic-conventions#316 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @trask, userConfig.getMaxAttempts() returns the user configured maximum attempts allowed for the retry policy execution. So, if this value is 3, the possibilities would be like [1(execution succeeded without retry), 2(first retry), 3(last attempt as configured)]. And what is implemented is using this fact, i.e, one by one between 1 and the maximum attempt.
I didn't take having enormous numbers into the account maybe. Do you think we should? If so, I can refactor this part to build up a list which distributes the range(1 to maxAttempt) evenly considering a maximum number of buckets like 10. Maybe something like this:
private static List<Long> buildBoundaries(int maxNumOfBuckets, long maxNumOfAttempts) {
List<Long> boundaries = new ArrayList<>(maxNumOfBuckets);
boundaries.add(1L);
double step = (double) (maxNumOfAttempts - 1) / (maxNumOfBuckets - 1);
for (int i = 1; i < maxNumOfBuckets; i++) {
long boundary = Math.min(Math.round(1 + step * i), maxNumOfAttempts);
boundaries.add(boundary);
}
return boundaries.stream()
.distinct()
.sorted()
.toList();
}What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buckets are costly, so I'd try to keep the number small if possible, e.g. with gc duration metrics, we went with just 5 buckets: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/runtime/jvm-metrics.md#metric-jvmgcduration
do you have any idea what are typical values for userConfig.getMaxAttempts()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 3 as default in Failsafe and same for resilience4j. I think it wouldn't make sense to have a value more than 5 in most of the cases so maybe just [ 1, 2, 3, 5 ]. What do you say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good
| "Count of execution events processed by the retry policy. " | ||
| + "Each event represents one complete execution flow (initial attempt + any retries). " | ||
| + "This metric does not count individual retry attempts - it counts each time the policy is invoked.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just an idea, but perhaps we could minimize the length of this description a bit by encoding some of this information in the unit?
| "Count of execution events processed by the retry policy. " | |
| + "Each event represents one complete execution flow (initial attempt + any retries). " | |
| + "This metric does not count individual retry attempts - it counts each time the policy is invoked.") | |
| "Count of execution events processed by the retry policy.") | |
| .setUnit("{policy_invocation}") | |
| LongHistogram attemptsHistogram = | ||
| meter | ||
| .histogramBuilder("failsafe.retry_policy.attempts") | ||
| .setDescription("Histogram of number of attempts for each execution.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .setDescription("Histogram of number of attempts for each execution.") | |
| .setDescription("Number of attempts for each execution.") |
| } | ||
|
|
||
| // then | ||
| testing.waitAndAssertMetrics("io.opentelemetry.failsafe-3.0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason not to use the other style of assertions here?
testing.waitAndAssertMetrics(
"io.opentelemetry.failsafe-3.0",
metricAssert ->
metricAssert
.hasName("failsafe.retry_policy.execution.count")
... etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing special, I was using the same style and after some point I think it was overcomplicating what I was trying to do and I switched to this style which I found easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usual way we do that @jaydeluca pointed out waits until the assertion succeeds. For example when the method is called and the metric data points aren't available yet (data is exported from a background thread) it will retry the assertions after waiting a bit.
The way you write it testing.waitAndAssertMetrics("io.opentelemetry.failsafe-3.0"); waits for any metric data to be available. The following code assumes that you have exactly 2 metrics. I find it somewhat hard to reason whether this is guaranteed or not. Probably it is, because no other metrics should be generated, but hard to tell whether there could be something else that could affect this. That is why I believe it is best to write the assertions the same way as they are used elsewhere.
Library instrumentation added for Failsafe's RetryPolicy.