Defer Channel Cleanup after a channel is closed to avoid kv-sql stress #10390

ziggie1984 · 2025-11-22T21:54:51Z

Summary

This PR defers the deletion of channel data (revocation logs, forwarding packages,
channel bucket) from CloseChannel to node startup via CleanupPendingCloses.

Problem: With the kvdb SQL backend (postgres/sqlite), closing a channel with many
states triggers deletion of thousands of revocation log entries in a single
transaction. Because the SQL implementation packs all KV entries into one table,
these bulk deletes are extremely slow and can cause timeout errors (unable to mark
channel closed: timeout: context deadline exceeded/dbentries exceeded). This also locks up the
database while lightning operations are trying to run.

Solution: Instead of deleting immediately during CloseChannel, we:

Mark the channel with ChanStatusPendingCleanup
Store cleanup metadata in a dedicated bucket
Filter these channels from all fetch operations
Perform the actual deletions at startup via CleanupPendingCloses when no
lightning operations are active

For bbolt the original design was acceptable due to its different performance
characteristics. With native SQL support in the future, this workaround should no
longer be necessary.

ziggie1984 · 2025-11-23T00:06:33Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a deferred cleanup mechanism for closed channels to alleviate stress on SQL-based database backends. Instead of performing expensive deletions synchronously during CloseChannel, channels are now marked as ChanStatusPendingCleanup and their cleanup metadata is stored. The actual deletion of channel data is deferred to node startup via a new CleanupPendingCloses function. This is a well-thought-out solution to the database contention problem. The changes are well-tested with a new unit test covering the deferred cleanup lifecycle. I've found a potential race condition when creating a top-level bucket and a minor inefficiency in the cleanup process, which I've commented on. Overall, this is a great improvement.

gemini-code-assist · 2025-11-23T00:11:13Z

channeldb/channel.go

+		cleanupBucket, err := tx.CreateTopLevelBucket(
+			pendingCleanupBucket,
+		)
 		if err != nil {
 			return err


The call to tx.CreateTopLevelBucket will return an error if the bucket already exists. If multiple channels are closed concurrently, this could lead to failures for all but the first one because they will all try to create the same top-level bucket. You should handle the kvdb.ErrBucketExists error gracefully.

gemini-code-assist · 2025-11-23T00:11:13Z

channeldb/db.go

+		// First, count how many channels need cleanup.
+		var numChannels int
+		err := cleanupBucket.ForEach(func(_, _ []byte) error {
+			numChannels++
+			return nil
+		})
+		if err != nil {
+			return err
+		}
+
+		if numChannels == 0 {
+			return nil
+		}
+
+		log.Infof("Starting cleanup of %d closed channel(s) - this "+
+			"may take a while", numChannels)


This function iterates over the cleanupBucket twice: once to count the number of channels for logging, and a second time to perform the cleanup. This could be inefficient during startup if there are many channels to clean up. To optimize, consider performing a single pass. For example, you could collect all keys and values into slices first, then log the count and process them. This would trade a small amount of memory for a potentially significant improvement in startup time.

Add a DeferrableBackend interface that backends can optionally implement to indicate they prefer deferring heavy operations (like bulk deletes of revocation logs) to startup rather than executing them inline. This is needed for postgres backends where concurrent large transactions can cause lock contention and timeouts during normal operation. The implementation: - Adds DeferrableBackend interface with ShouldDeferHeavyOperations() - Adds helper function to check if a backend implements the interface - Implements the interface on sqlbase.db, returning true only for postgres (identified by "pgx" driver) and false for sqlite run go mod tidy

For postgres backends, defer heavy cleanup operations (deleting revocation logs, forwarding packages) to startup rather than performing them inline during CloseChannel. This avoids lock contention and transaction timeouts that can occur when multiple channels are closed concurrently. Changes: - Add pendingCleanupBucket to store channels awaiting cleanup - Add PendingCleanupInfo struct with Encode/Decode methods - Modify CloseChannel to check kvdb.ShouldDeferHeavyOperations(): - For postgres: store cleanup info and defer deletion to startup - For bbolt/sqlite: perform immediate cleanup as before - Add storePendingCleanup() and performImmediateCleanup() helpers

Add CleanupPendingCloses() function to process channels that were closed but whose heavy cleanup was deferred to startup. This is used by postgres backends to avoid lock contention during normal operation. The function: - Reads all entries from pendingCleanupBucket - For each entry, performs the deferred cleanup operations: - Wipes forwarding packages - Deletes channel metadata - Removes thaw height if applicable - Deletes revocation logs - Removes the channel bucket - Removes the processed entry from pendingCleanupBucket For non-postgres backends, this function is a no-op as the bucket will be empty (they perform immediate cleanup).

Call CleanupPendingCloses() during LND startup to process any channels that had their heavy cleanup deferred. This is only done for postgres backends (checked via kvdb.ShouldDeferHeavyOperations). For bbolt and sqlite backends, this check is skipped as they perform immediate cleanup during CloseChannel.

Add unit tests for the deferred channel cleanup functionality: - TestPendingCleanupInfoEncodeDecode: Tests that PendingCleanupInfo can be properly encoded and decoded - TestCleanupPendingClosesEmpty: Tests that CleanupPendingCloses works correctly when there are no pending cleanups - TestImmediateCleanupOnClose: Tests that for non-postgres backends, channel close performs immediate cleanup without deferring

Split the wipe forwarding packages integration test into backend-specific versions to account for the different cleanup behaviors: - For bbolt/sqlite (immediate cleanup): - Test remains in list_on_bbolt_test.go with build tag !kvdb_postgres - Forwarding packages are deleted immediately during CloseChannel - No node restart needed to verify cleanup - For postgres (deferred cleanup): - New test in lnd_wipe_fwdpkgs_sql_test.go with build tag kvdb_postgres - Forwarding packages are NOT deleted during CloseChannel - Node restart triggers CleanupPendingCloses() which performs cleanup - Test verifies cleanup after restart Also updates list_exclude_test.go to include both test variants in the Windows exclusion list.

Update tests that close channels and immediately check for deletion to handle the deferred cleanup behavior for postgres backends: - TestAbandonChannel - TestOpenChannelPutGetDelete - TestChannelStateTransition For postgres backends (where ShouldDeferHeavyOperations returns true), call CleanupPendingCloses() after closing channels to process the deferred cleanup before asserting that channels are deleted.

ziggie1984 changed the title ~~Deffer Channel Cleanup after a channel is closed to avoid kv-sql stress~~ Defer Channel Cleanup after a channel is closed to avoid kv-sql stress Nov 22, 2025

ziggie1984 self-assigned this Nov 22, 2025

ziggie1984 added kvdb optimization sql labels Nov 22, 2025

ziggie1984 force-pushed the deferred-channel-cleanup branch from 41fdc4b to 7bf68fa Compare November 22, 2025 23:51

gemini-code-assist bot reviewed Nov 23, 2025

View reviewed changes

ziggie1984 force-pushed the deferred-channel-cleanup branch 2 times, most recently from 4927490 to 791ad11 Compare November 23, 2025 19:16

ziggie1984 added 8 commits November 23, 2025 20:43

docs: add release-notes for LND 20.1

bf85a3d

ziggie1984 force-pushed the deferred-channel-cleanup branch from 791ad11 to bf85a3d Compare November 23, 2025 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Defer Channel Cleanup after a channel is closed to avoid kv-sql stress #10390

Defer Channel Cleanup after a channel is closed to avoid kv-sql stress #10390

Uh oh!

ziggie1984 commented Nov 22, 2025 •

edited

Loading

Uh oh!

ziggie1984 commented Nov 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 23, 2025

Uh oh!

gemini-code-assist bot Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Defer Channel Cleanup after a channel is closed to avoid kv-sql stress #10390

Are you sure you want to change the base?

Defer Channel Cleanup after a channel is closed to avoid kv-sql stress #10390

Uh oh!

Conversation

ziggie1984 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ziggie1984 commented Nov 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ziggie1984 commented Nov 22, 2025 •

edited

Loading