Skip to content

MSQuic will stall in OOM conditions on Windows #5369

@fbrosseau

Description

@fbrosseau

Describe the bug

All calls to CxPlatEventQEnqueue/Ex must check for return value, otherwise MSQuic will stall.

At first glance, I can find these problematic places:

Q:\src\msquic\src\platform\datapath_raw_xdp_win.c(1951):CxPlatEventQEnqueue(Partition->EventQ, &Partition->ShutdownSqe);
Q:\src\msquic\src\platform\platform_worker.c(228):CxPlatEventQEnqueue(&Worker->EventQ, &Worker->ShutdownSqe);
Q:\src\msquic\src\platform\platform_worker.c(504):CxPlatEventQEnqueue(&Worker->EventQ, &Worker->UpdatePollSqe);
Q:\src\msquic\src\platform\platform_worker.c(515):CxPlatEventQEnqueue(&Worker->EventQ, &Worker->WakeSqe);

I am going to write this bug report all in terms of Windows, but I would assume that it applies to some of the other implementations as well (in the epoll or kqueue io_uring worlds).

CxPlatEventQEnqueue, aka PostQueuedCompletionStatus, can fail for two reasons:
1- The IOCP handle is bogus, or it was unexpectedly closed - this would be a program error.
2- OOM. The reason for this is that NtSetIoCompletion underneath PQCS must allocate an IO completion packet to add to the target IOCP *.

It would not be easy to manually repro this, but in complex programs that like to live on the edge of 100% memory (like a database), this is guaranteed to happen as a one-in-a-million bug.

Similarly, I can see that some usages check the return with CXPLAT_FRE_ASSERT & co. This is problematic in programs that expect to out-live OOM issues.

I am suggesting more robust implementations below, but I recognize that they may be big amounts of work. At minimum, MSQuic must handle the error returns.

alternatives

Alternatively, one thing that could be done, would be to NOT use PostQueuedCompletionStatus for things like Wake worker up, and instead use Win32 events and wait completion packets. If you are unfamiliar, wait completion packets were added to win7 (iirc?) and they allow posting an IO completion item to the IOCP of your choice when an event gets signaled. This innovation closed the gap between the event/wait world and the IOCP world. This would be failure-free, and I believe it would outperform the current solution of stock PostQueuedCompletionStatus. As a matter of fact, this is even closer in spirit to how you would implement wakeup in Linux, with an eventfd+epoll pair, so it might actually make the crossplatform code simpler.

*If we want to get technical, it is possible to leverage "reserve objects" through NtAllocateReserveObject and eliminate the possibility of OOM entirely by pre-creating the reusable IO packet. Not only does this remove the problem, but it is also an optimization. However, MSQuic would have to get OS folks to document it to MSDN it in order to be able to use it.

Affected OS

  • Windows
  • Linux
  • macOS
  • Other (specify below)

Additional OS information

All Windows versions

MsQuic version

main

Steps taken to reproduce bug

Be at memory exhaustion levels
CxPlatEventQEnqueue

Expected behavior

N/A

Actual outcome

Ignoring the errors means ignoring that the continuation will never happen. If the continuation never happens, the program will stall.

Additional details

No response

Metadata

Metadata

Assignees

Labels

Area: PlatformTriagedThis item has been triaged by an MsQuic owner

Type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions