-
Notifications
You must be signed in to change notification settings - Fork 2.2k
channeldb: add reject and channel caches #2847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
channeldb: add reject and channel caches #2847
Conversation
3a4ed58
to
c131a21
Compare
c131a21
to
4aab755
Compare
Now that the dependent PR has been merged, this can be rebased! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay caching! I've been running this on a few of my mainnet nodes and have noticed that combined with the gossip sync PR, it pretty much does away with the large memory burst a node can see today on restart.
No major comments, other than pointing out a future direction that we may want to consider where we update the cache with a new entry rather than removing the entry from the cache. Eventually, we can also extend this channel cache to be used in things like path finding or computing DescribeGraph
, etc.
channeldb/reject_cache.go
Outdated
// rejectFlagExists is a flag indicating whether the channel exists, | ||
// i.e. the channel is open and has a recent channel update. If this | ||
// flag is not set, the channel is either a zombie or unknown. | ||
rejectFlagExists = 1 << 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a safe opportunity to use iota
, and also declare these to be typed as rejectFlags
rather than being an uint8
.
channeldb/graph.go
Outdated
return err | ||
} | ||
|
||
c.rejectCache.remove(edge.ChannelID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can insert the new copy of the edge policy into the cache. I see no issue in delaying this to a distinct change though once we see how this fares in the wild once most of the network has updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are certain peers that request a full dump of all known channels, which will require going to disk (unless of course the channel cache is configured to hold the entire graph in memory). AFAIK, CL currently queries for the entire range on each connection, and after #2740 LND nodes will begin doing so roughly once every six hours to a random peer to ensure that any holes in its routing table are filled.
This won't be too bad since we'll only request all the channel_id
s we don't know of that the remote peer does. After this is done once, the cost of a historical sync should be pretty negligible as long as we don't go offline for a long period of time.
4aab755
to
a9da15c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🍭
Have been testing this out on my node over the past few days, and it's significantly helped with the initial allocation burst when connecting to peers for the first time. I anticipate we'll also get a lot of feedback from node operators during the RC cycle as well, which can be used to modify the cache write policies or default sizes. Needs a rebase to remove the Drawin travis timing commits!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
}) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
c.rejectCache.remove(edge.ChannelID) | ||
c.chanCache.remove(edge.ChannelID) | ||
if entry, ok := c.rejectCache.get(edge.ChannelID); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to just remove IMO, we have no guarantees that the updated policy doesn't change the value of the rejectFlags
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reject flags are only changed wrt to channel existence/zombie pruning. updating the edge policy shouldn't modify rejectFlags
a9da15c
to
0a0cd51
Compare
This reverts commit 3555d3d.
…use Batch" This reverts commit e8da6dd.
This reverts commit da76c34.
This commit introduces the Validator interface, which is intended to be implemented by any sub configs. It specifies a Validate() error method that should fail if a sub configuration contains any invalid or insane parameters. In addition, a package-level Validate method can be used to check a variadic number of sub configs implementing the Validator interface. This allows the primary config struct to be extended via targeted and/or specialized sub configs, and validate all of them in sequence without bloating the main package with the actual validation logic.
0a0cd51
to
5d98a94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🛰
This PR adds two caches housed within the channeldb to optimize two existing hot spots related to gossip traffic.
Reject Cache
The first is dubbed a reject cache, whose entries contain a small amount of information critical to determining if we should spend resources validating a particular channel announcement or channel update. This improves the performance of
HasChannelEdge
, which is a subroutine ofKnownEdge
andIsStaleEdgePolicy
.Each entry in the reject cache stores the following:
where
flags
is packed bitfield containing, for now, theexists
andisZombie
booleans. We store the time as an unix integer as opposed to thetime.Time
directly, sincetime.Time
's internal pointer for storing timezone info would force the garbage collector to traverse these long-lived entries. They are subsequently reconstructed during calls toHasChannelEdge
using the integral value.The addition of the reject cache greatly improves LND's ability to efficiently filter gossip traffic, and results in a significantly lower number of database accesses for this purpose. Users should notice the gossip syncers terminating much quicker (due to the absence of db hits) and overall result in snappier connections.
Channel Cache
The second is channel cache which caches
ChannelEdge
values, and used to reduce memory allocations stemming fromChanUpdatesInHorizon
. AChannelEdge
has the following structure:Currently, each call to
ChanUpdatesInHorizon
will seek and deserialize allChannelEdge
values in the requested range. When connected to large number of peers, this can result an excessive amount of memory that must be 1) allocated, and 2) cleaned up by the garbage collector. The values are intended to be read only, and are discarded as soon as the relevant information is written out on the wire to the peers.As a result, the channel cache can greatly reduce the amount of wasted allocations, especially if a large percentage of the requested range is held in memory or peers request similar time slices of the graph.
Eviction
Both caches employ randomized eviction when inserting an element would cause the cache to exceed its configured capacity. The rationale stems from the fact that the access pattern for these caches is dictated entirely by our peers. Assuming the entire working set cannot fit in memory, a deterministic caching strategy would ease a malicious peer's ability to craft access patterns that incur a worst-case hit frequency (close-to-or-equal-to 0%). The resulting effect would that we equivalent to having no cache at all, and force us to hit disk for each rejection check and/or deserialize each requested
ChannelEdge
. The randomized eviction strategy thus provides some level of DOS protection, while also being simple to and efficient to implement in go (because map iteration is randomized by default).Lazy Consistency
For some cases, keeping the cache in sync with the on-disk state requires reading and deserializing extra data from the db that is not deducible from the inputs. However, at the time the entry is modified, it's not certain that the entry will be accessed again, meaning that extraneous allocations and deserializations may be performed, even though the entry could be evicted before that data is ever used.
For this reason, both the reject and channel caches remove entries from cache whenever an operation dirties an entry and then lazily loads them on the next access. The lone exception is
UpdateChannelPolicy
, where we write through info to the caches if those entries are already present because it is the most frequently used operation. If the entries are not present, then they are lazily loaded on the next access for the reasons stated above.There are other possible places we could add write through, though removing the entry is by far the safest alternative. We can proceed in doing so in other places if they prove to be a bottleneck.
CLI configuration
Each cache can be configured by way of the new
lncfg.Caches
subconfig, allowing users to set the maximum number of cache entries for both the reject and channel caches. The configuration options appear as:At the default values provided, the reject cache occupies 1.2MB and easily holds an entry for today's entire graph. The channel cache occupies about 40MB, holding about half of the channels in memory. The majority of peers query for values near the tip of the graph, allowing gossip queries to be satisfied almost entirely from the in-memory contents.
There are certain peers that request a full dump of all known channels, which will require going to disk (unless of course the channel cache is configured to hold the entire graph in memory). AFAIK, CL currently queries for the entire range on each connection, and after #2740 LND nodes will begin doing so roughly once every six hours to a random peer to ensure that any holes in its routing table are filled. Inherently no caching algorithm can be optimal under such circumstances, though empirically, the channel cache quickly converges back to having most of the elements at tip for subsequent queries on the hot spots.
I'm open to other opinions on the default cache sizes, please discuss if people have others they prefer!
Depends on: