-
Notifications
You must be signed in to change notification settings - Fork 621
Description
Describe the bug
All calls to CxPlatEventQEnqueue/Ex must check for return value, otherwise MSQuic will stall.
At first glance, I can find these problematic places:
Q:\src\msquic\src\platform\datapath_raw_xdp_win.c(1951):CxPlatEventQEnqueue(Partition->EventQ, &Partition->ShutdownSqe);
Q:\src\msquic\src\platform\platform_worker.c(228):CxPlatEventQEnqueue(&Worker->EventQ, &Worker->ShutdownSqe);
Q:\src\msquic\src\platform\platform_worker.c(504):CxPlatEventQEnqueue(&Worker->EventQ, &Worker->UpdatePollSqe);
Q:\src\msquic\src\platform\platform_worker.c(515):CxPlatEventQEnqueue(&Worker->EventQ, &Worker->WakeSqe);
I am going to write this bug report all in terms of Windows, but I would assume that it applies to some of the other implementations as well (in the epoll or kqueue io_uring worlds).
CxPlatEventQEnqueue, aka PostQueuedCompletionStatus, can fail for two reasons:
1- The IOCP handle is bogus, or it was unexpectedly closed - this would be a program error.
2- OOM. The reason for this is that NtSetIoCompletion underneath PQCS must allocate an IO completion packet to add to the target IOCP *.
It would not be easy to manually repro this, but in complex programs that like to live on the edge of 100% memory (like a database), this is guaranteed to happen as a one-in-a-million bug.
Similarly, I can see that some usages check the return with CXPLAT_FRE_ASSERT & co. This is problematic in programs that expect to out-live OOM issues.
I am suggesting more robust implementations below, but I recognize that they may be big amounts of work. At minimum, MSQuic must handle the error returns.
alternatives
Alternatively, one thing that could be done, would be to NOT use PostQueuedCompletionStatus for things like Wake worker up, and instead use Win32 events and wait completion packets. If you are unfamiliar, wait completion packets were added to win7 (iirc?) and they allow posting an IO completion item to the IOCP of your choice when an event gets signaled. This innovation closed the gap between the event/wait world and the IOCP world. This would be failure-free, and I believe it would outperform the current solution of stock PostQueuedCompletionStatus. As a matter of fact, this is even closer in spirit to how you would implement wakeup in Linux, with an eventfd+epoll pair, so it might actually make the crossplatform code simpler.
*If we want to get technical, it is possible to leverage "reserve objects" through NtAllocateReserveObject and eliminate the possibility of OOM entirely by pre-creating the reusable IO packet. Not only does this remove the problem, but it is also an optimization. However, MSQuic would have to get OS folks to document it to MSDN it in order to be able to use it.
Affected OS
- Windows
- Linux
- macOS
- Other (specify below)
Additional OS information
All Windows versions
MsQuic version
main
Steps taken to reproduce bug
Be at memory exhaustion levels
CxPlatEventQEnqueue
Expected behavior
N/A
Actual outcome
Ignoring the errors means ignoring that the continuation will never happen. If the continuation never happens, the program will stall.
Additional details
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status