-
Notifications
You must be signed in to change notification settings - Fork 618
Added failure path in case of Event Enqueue fails #5617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@Santhosha-bk please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
| "CxPlatEventQEnqueue failed (Shutdown)"); | ||
|
|
||
| // Queue can’t run the completion, so do it inline to finish teardown. | ||
| CxPlatSocketContextUninitializeEventComplete(&SocketContext->ShutdownSqe.Cqe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can simply run inline - in at least some of these files, I expect the orderly completion is relied on for serialization / mutual exclusion. Instead, we need to provide a backup mechanism that ensures orderly completion runs even if the underlying OS queue fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtfriesen Rather than inline, we can try to re-add the event in queue with a delay in case of failure as a backup mechanism, we can add a 1ms delay and re-add the event like below. Please share your thoughts on this approach.
while (!CxPlatEventQEnqueue(&Worker->EventQ, &Worker->ShutdownSqe)) {
if (++RetryCount >= MaxRetries) {
QuicTraceLogWarning(
WorkerShutdownEnqueueRetry,
"[wrkr][%p] Shutdown enqueue failed, retry %u/%u",
Worker,
RetryCount,
MaxRetries);
break;
}
// Exponential backoff: 1ms, 2ms, 4ms, 8ms, ...
CxPlatSleep(1U << (RetryCount - 1));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that re-trying with delay is an acceptable option, and it only pushes back the problem: what happens if it keeps failing, do we hang forever? Or are we back to a fre assert?
It might also be worth checking whether the issue reported in the bug can happen for Unix.
Typically, memory management won't let an allocation failure happen (a process will either get memory or be killed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we could have loop of 5-10 re-tries, but this is not an acceptable solution.
MsQuic stalls only in CxPlatWorkerPoolDestroyWorker because the thread waits for the worker thread to rejoin after a failed CxPlatEventQEnqueue. In all other cases where CxPlatEventQEnqueue is used, this behavior does not occur.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5617 +/- ##
=======================================
Coverage ? 85.28%
=======================================
Files ? 60
Lines ? 18663
Branches ? 0
=======================================
Hits ? 15917
Misses ? 2746
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
CxPlatEventQEnqueue, aka PostQueuedCompletionStatus, can fail because of OOM. The reason for this is that NtSetIoCompletion underneath PQCS must allocate an IO completion packet to add to the target IOCP *.
Same applies in case of epoll or kqueue io_uring.
Added a failure paths to finish the work inline incase of CxPlatEventQEnqueue fails.
#5369
Describe the purpose of and changes within this Pull Request.
Testing
Do any existing tests cover this change? Are new tests needed?
Documentation
Is there any documentation impact for this change?