-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Defer Channel Cleanup after a channel is closed to avoid kv-sql stress #10390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Defer Channel Cleanup after a channel is closed to avoid kv-sql stress #10390
Conversation
41fdc4b to
7bf68fa
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a deferred cleanup mechanism for closed channels to alleviate stress on SQL-based database backends. Instead of performing expensive deletions synchronously during CloseChannel, channels are now marked as ChanStatusPendingCleanup and their cleanup metadata is stored. The actual deletion of channel data is deferred to node startup via a new CleanupPendingCloses function. This is a well-thought-out solution to the database contention problem. The changes are well-tested with a new unit test covering the deferred cleanup lifecycle. I've found a potential race condition when creating a top-level bucket and a minor inefficiency in the cleanup process, which I've commented on. Overall, this is a great improvement.
channeldb/channel.go
Outdated
| cleanupBucket, err := tx.CreateTopLevelBucket( | ||
| pendingCleanupBucket, | ||
| ) | ||
| if err != nil { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to tx.CreateTopLevelBucket will return an error if the bucket already exists. If multiple channels are closed concurrently, this could lead to failures for all but the first one because they will all try to create the same top-level bucket. You should handle the kvdb.ErrBucketExists error gracefully.
channeldb/db.go
Outdated
| // First, count how many channels need cleanup. | ||
| var numChannels int | ||
| err := cleanupBucket.ForEach(func(_, _ []byte) error { | ||
| numChannels++ | ||
| return nil | ||
| }) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| if numChannels == 0 { | ||
| return nil | ||
| } | ||
|
|
||
| log.Infof("Starting cleanup of %d closed channel(s) - this "+ | ||
| "may take a while", numChannels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function iterates over the cleanupBucket twice: once to count the number of channels for logging, and a second time to perform the cleanup. This could be inefficient during startup if there are many channels to clean up. To optimize, consider performing a single pass. For example, you could collect all keys and values into slices first, then log the count and process them. This would trade a small amount of memory for a potentially significant improvement in startup time.
4927490 to
791ad11
Compare
Add a DeferrableBackend interface that backends can optionally implement to indicate they prefer deferring heavy operations (like bulk deletes of revocation logs) to startup rather than executing them inline. This is needed for postgres backends where concurrent large transactions can cause lock contention and timeouts during normal operation. The implementation: - Adds DeferrableBackend interface with ShouldDeferHeavyOperations() - Adds helper function to check if a backend implements the interface - Implements the interface on sqlbase.db, returning true only for postgres (identified by "pgx" driver) and false for sqlite run go mod tidy
For postgres backends, defer heavy cleanup operations (deleting revocation logs, forwarding packages) to startup rather than performing them inline during CloseChannel. This avoids lock contention and transaction timeouts that can occur when multiple channels are closed concurrently. Changes: - Add pendingCleanupBucket to store channels awaiting cleanup - Add PendingCleanupInfo struct with Encode/Decode methods - Modify CloseChannel to check kvdb.ShouldDeferHeavyOperations(): - For postgres: store cleanup info and defer deletion to startup - For bbolt/sqlite: perform immediate cleanup as before - Add storePendingCleanup() and performImmediateCleanup() helpers
Add CleanupPendingCloses() function to process channels that were closed but whose heavy cleanup was deferred to startup. This is used by postgres backends to avoid lock contention during normal operation. The function: - Reads all entries from pendingCleanupBucket - For each entry, performs the deferred cleanup operations: - Wipes forwarding packages - Deletes channel metadata - Removes thaw height if applicable - Deletes revocation logs - Removes the channel bucket - Removes the processed entry from pendingCleanupBucket For non-postgres backends, this function is a no-op as the bucket will be empty (they perform immediate cleanup).
Call CleanupPendingCloses() during LND startup to process any channels that had their heavy cleanup deferred. This is only done for postgres backends (checked via kvdb.ShouldDeferHeavyOperations). For bbolt and sqlite backends, this check is skipped as they perform immediate cleanup during CloseChannel.
Add unit tests for the deferred channel cleanup functionality: - TestPendingCleanupInfoEncodeDecode: Tests that PendingCleanupInfo can be properly encoded and decoded - TestCleanupPendingClosesEmpty: Tests that CleanupPendingCloses works correctly when there are no pending cleanups - TestImmediateCleanupOnClose: Tests that for non-postgres backends, channel close performs immediate cleanup without deferring
Split the wipe forwarding packages integration test into backend-specific versions to account for the different cleanup behaviors: - For bbolt/sqlite (immediate cleanup): - Test remains in list_on_bbolt_test.go with build tag !kvdb_postgres - Forwarding packages are deleted immediately during CloseChannel - No node restart needed to verify cleanup - For postgres (deferred cleanup): - New test in lnd_wipe_fwdpkgs_sql_test.go with build tag kvdb_postgres - Forwarding packages are NOT deleted during CloseChannel - Node restart triggers CleanupPendingCloses() which performs cleanup - Test verifies cleanup after restart Also updates list_exclude_test.go to include both test variants in the Windows exclusion list.
Update tests that close channels and immediately check for deletion to handle the deferred cleanup behavior for postgres backends: - TestAbandonChannel - TestOpenChannelPutGetDelete - TestChannelStateTransition For postgres backends (where ShouldDeferHeavyOperations returns true), call CleanupPendingCloses() after closing channels to process the deferred cleanup before asserting that channels are deleted.
791ad11 to
bf85a3d
Compare
Summary
This PR defers the deletion of channel data (revocation logs, forwarding packages,
channel bucket) from CloseChannel to node startup via CleanupPendingCloses.
Problem: With the kvdb SQL backend (postgres/sqlite), closing a channel with many
states triggers deletion of thousands of revocation log entries in a single
transaction. Because the SQL implementation packs all KV entries into one table,
these bulk deletes are extremely slow and can cause timeout errors (unable to mark
channel closed: timeout: context deadline exceeded/dbentries exceeded). This also locks up the
database while lightning operations are trying to run.
Solution: Instead of deleting immediately during CloseChannel, we:
lightning operations are active
For bbolt the original design was acceptable due to its different performance
characteristics. With native SQL support in the future, this workaround should no
longer be necessary.