Skip to content

Conversation

@ziggie1984
Copy link
Collaborator

@ziggie1984 ziggie1984 commented Nov 22, 2025

Summary

This PR defers the deletion of channel data (revocation logs, forwarding packages,
channel bucket) from CloseChannel to node startup via CleanupPendingCloses.

Problem: With the kvdb SQL backend (postgres/sqlite), closing a channel with many
states triggers deletion of thousands of revocation log entries in a single
transaction. Because the SQL implementation packs all KV entries into one table,
these bulk deletes are extremely slow and can cause timeout errors (unable to mark
channel closed: timeout: context deadline exceeded/dbentries exceeded). This also locks up the
database while lightning operations are trying to run.

Solution: Instead of deleting immediately during CloseChannel, we:

  1. Mark the channel with ChanStatusPendingCleanup
  2. Store cleanup metadata in a dedicated bucket
  3. Filter these channels from all fetch operations
  4. Perform the actual deletions at startup via CleanupPendingCloses when no
    lightning operations are active

For bbolt the original design was acceptable due to its different performance
characteristics. With native SQL support in the future, this workaround should no
longer be necessary.

@ziggie1984 ziggie1984 changed the title Deffer Channel Cleanup after a channel is closed to avoid kv-sql stress Defer Channel Cleanup after a channel is closed to avoid kv-sql stress Nov 22, 2025
@ziggie1984 ziggie1984 self-assigned this Nov 22, 2025
@ziggie1984 ziggie1984 force-pushed the deferred-channel-cleanup branch from 41fdc4b to 7bf68fa Compare November 22, 2025 23:51
@ziggie1984
Copy link
Collaborator Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a deferred cleanup mechanism for closed channels to alleviate stress on SQL-based database backends. Instead of performing expensive deletions synchronously during CloseChannel, channels are now marked as ChanStatusPendingCleanup and their cleanup metadata is stored. The actual deletion of channel data is deferred to node startup via a new CleanupPendingCloses function. This is a well-thought-out solution to the database contention problem. The changes are well-tested with a new unit test covering the deferred cleanup lifecycle. I've found a potential race condition when creating a top-level bucket and a minor inefficiency in the cleanup process, which I've commented on. Overall, this is a great improvement.

Comment on lines 3978 to 3982
cleanupBucket, err := tx.CreateTopLevelBucket(
pendingCleanupBucket,
)
if err != nil {
return err

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The call to tx.CreateTopLevelBucket will return an error if the bucket already exists. If multiple channels are closed concurrently, this could lead to failures for all but the first one because they will all try to create the same top-level bucket. You should handle the kvdb.ErrBucketExists error gracefully.

channeldb/db.go Outdated
Comment on lines 1286 to 1301
// First, count how many channels need cleanup.
var numChannels int
err := cleanupBucket.ForEach(func(_, _ []byte) error {
numChannels++
return nil
})
if err != nil {
return err
}

if numChannels == 0 {
return nil
}

log.Infof("Starting cleanup of %d closed channel(s) - this "+
"may take a while", numChannels)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function iterates over the cleanupBucket twice: once to count the number of channels for logging, and a second time to perform the cleanup. This could be inefficient during startup if there are many channels to clean up. To optimize, consider performing a single pass. For example, you could collect all keys and values into slices first, then log the count and process them. This would trade a small amount of memory for a potentially significant improvement in startup time.

@ziggie1984 ziggie1984 force-pushed the deferred-channel-cleanup branch 2 times, most recently from 4927490 to 791ad11 Compare November 23, 2025 19:16
Add a DeferrableBackend interface that backends can optionally implement
to indicate they prefer deferring heavy operations (like bulk deletes of
revocation logs) to startup rather than executing them inline.

This is needed for postgres backends where concurrent large transactions
can cause lock contention and timeouts during normal operation.

The implementation:
- Adds DeferrableBackend interface with ShouldDeferHeavyOperations()
- Adds helper function to check if a backend implements the interface
- Implements the interface on sqlbase.db, returning true only for postgres
  (identified by "pgx" driver) and false for sqlite

run go mod tidy
For postgres backends, defer heavy cleanup operations (deleting revocation
logs, forwarding packages) to startup rather than performing them inline
during CloseChannel. This avoids lock contention and transaction timeouts
that can occur when multiple channels are closed concurrently.

Changes:
- Add pendingCleanupBucket to store channels awaiting cleanup
- Add PendingCleanupInfo struct with Encode/Decode methods
- Modify CloseChannel to check kvdb.ShouldDeferHeavyOperations():
  - For postgres: store cleanup info and defer deletion to startup
  - For bbolt/sqlite: perform immediate cleanup as before
- Add storePendingCleanup() and performImmediateCleanup() helpers
Add CleanupPendingCloses() function to process channels that were closed
but whose heavy cleanup was deferred to startup. This is used by postgres
backends to avoid lock contention during normal operation.

The function:
- Reads all entries from pendingCleanupBucket
- For each entry, performs the deferred cleanup operations:
  - Wipes forwarding packages
  - Deletes channel metadata
  - Removes thaw height if applicable
  - Deletes revocation logs
  - Removes the channel bucket
- Removes the processed entry from pendingCleanupBucket

For non-postgres backends, this function is a no-op as the bucket
will be empty (they perform immediate cleanup).
Call CleanupPendingCloses() during LND startup to process any channels
that had their heavy cleanup deferred. This is only done for postgres
backends (checked via kvdb.ShouldDeferHeavyOperations).

For bbolt and sqlite backends, this check is skipped as they perform
immediate cleanup during CloseChannel.
Add unit tests for the deferred channel cleanup functionality:

- TestPendingCleanupInfoEncodeDecode: Tests that PendingCleanupInfo
  can be properly encoded and decoded
- TestCleanupPendingClosesEmpty: Tests that CleanupPendingCloses
  works correctly when there are no pending cleanups
- TestImmediateCleanupOnClose: Tests that for non-postgres backends,
  channel close performs immediate cleanup without deferring
Split the wipe forwarding packages integration test into backend-specific
versions to account for the different cleanup behaviors:

- For bbolt/sqlite (immediate cleanup):
  - Test remains in list_on_bbolt_test.go with build tag !kvdb_postgres
  - Forwarding packages are deleted immediately during CloseChannel
  - No node restart needed to verify cleanup

- For postgres (deferred cleanup):
  - New test in lnd_wipe_fwdpkgs_sql_test.go with build tag kvdb_postgres
  - Forwarding packages are NOT deleted during CloseChannel
  - Node restart triggers CleanupPendingCloses() which performs cleanup
  - Test verifies cleanup after restart

Also updates list_exclude_test.go to include both test variants in
the Windows exclusion list.
Update tests that close channels and immediately check for deletion to
handle the deferred cleanup behavior for postgres backends:

- TestAbandonChannel
- TestOpenChannelPutGetDelete
- TestChannelStateTransition

For postgres backends (where ShouldDeferHeavyOperations returns true),
call CleanupPendingCloses() after closing channels to process the
deferred cleanup before asserting that channels are deleted.
@ziggie1984 ziggie1984 force-pushed the deferred-channel-cleanup branch from 791ad11 to bf85a3d Compare November 23, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant