channeldb: add reject and channel caches #2847

cfromknecht · 2019-03-28T03:30:48Z

This PR adds two caches housed within the channeldb to optimize two existing hot spots related to gossip traffic.

Reject Cache

The first is dubbed a reject cache, whose entries contain a small amount of information critical to determining if we should spend resources validating a particular channel announcement or channel update. This improves the performance of HasChannelEdge, which is a subroutine of KnownEdge and IsStaleEdgePolicy.

Each entry in the reject cache stores the following:

type rejectCacheEntry struct {
    upd1Time int64
    upd2Time int64
    flags    rejectFlags
}

where flags is packed bitfield containing, for now, the exists and isZombie booleans. We store the time as an unix integer as opposed to the time.Time directly, since time.Time's internal pointer for storing timezone info would force the garbage collector to traverse these long-lived entries. They are subsequently reconstructed during calls to HasChannelEdge using the integral value.

The addition of the reject cache greatly improves LND's ability to efficiently filter gossip traffic, and results in a significantly lower number of database accesses for this purpose. Users should notice the gossip syncers terminating much quicker (due to the absence of db hits) and overall result in snappier connections.

Channel Cache

The second is channel cache which caches ChannelEdge values, and used to reduce memory allocations stemming from ChanUpdatesInHorizon. A ChannelEdge has the following structure:

type ChannelEdge struct {
    Info    *ChannelEdgeInfo
    Policy1 *ChannelEdgePolicy
    Policy2 *ChannelEdgePolicy
}

Currently, each call to ChanUpdatesInHorizon will seek and deserialize all ChannelEdge values in the requested range. When connected to large number of peers, this can result an excessive amount of memory that must be 1) allocated, and 2) cleaned up by the garbage collector. The values are intended to be read only, and are discarded as soon as the relevant information is written out on the wire to the peers.

As a result, the channel cache can greatly reduce the amount of wasted allocations, especially if a large percentage of the requested range is held in memory or peers request similar time slices of the graph.

Eviction

Both caches employ randomized eviction when inserting an element would cause the cache to exceed its configured capacity. The rationale stems from the fact that the access pattern for these caches is dictated entirely by our peers. Assuming the entire working set cannot fit in memory, a deterministic caching strategy would ease a malicious peer's ability to craft access patterns that incur a worst-case hit frequency (close-to-or-equal-to 0%). The resulting effect would that we equivalent to having no cache at all, and force us to hit disk for each rejection check and/or deserialize each requested ChannelEdge. The randomized eviction strategy thus provides some level of DOS protection, while also being simple to and efficient to implement in go (because map iteration is randomized by default).

Lazy Consistency

For some cases, keeping the cache in sync with the on-disk state requires reading and deserializing extra data from the db that is not deducible from the inputs. However, at the time the entry is modified, it's not certain that the entry will be accessed again, meaning that extraneous allocations and deserializations may be performed, even though the entry could be evicted before that data is ever used.

For this reason, both the reject and channel caches remove entries from cache whenever an operation dirties an entry and then lazily loads them on the next access. The lone exception is UpdateChannelPolicy, where we write through info to the caches if those entries are already present because it is the most frequently used operation. If the entries are not present, then they are lazily loaded on the next access for the reasons stated above.

There are other possible places we could add write through, though removing the entry is by far the safest alternative. We can proceed in doing so in other places if they prove to be a bottleneck.

CLI configuration

Each cache can be configured by way of the new lncfg.Caches subconfig, allowing users to set the maximum number of cache entries for both the reject and channel caches. The configuration options appear as:

caches:
      --caches.reject-cache-size=    Maximum number of entries contained in the reject cache, which is used to speed up filtering of new channel announcements and channel updates from peers. Each entry requires 25 bytes. (default: 50000)
      --caches.channel-cache-size=   Maximum number of entries contained in the channel cache, which is used to reduce memory allocations from gossip queries from peers. Each entry requires 2Kb. (default: 20000)

At the default values provided, the reject cache occupies 1.2MB and easily holds an entry for today's entire graph. The channel cache occupies about 40MB, holding about half of the channels in memory. The majority of peers query for values near the tip of the graph, allowing gossip queries to be satisfied almost entirely from the in-memory contents.

There are certain peers that request a full dump of all known channels, which will require going to disk (unless of course the channel cache is configured to hold the entire graph in memory). AFAIK, CL currently queries for the entire range on each connection, and after #2740 LND nodes will begin doing so roughly once every six hours to a random peer to ensure that any holes in its routing table are filled. Inherently no caching algorithm can be optimal under such circumstances, though empirically, the channel cache quickly converges back to having most of the elements at tip for subsequent queries on the hot spots.

I'm open to other opinions on the default cache sizes, please discuss if people have others they prefer!

Depends on:

lntest/timeout: split into darwin and non-darwin timeouts #2846

Roasbeef · 2019-03-28T23:46:27Z

Now that the dependent PR has been merged, this can be rebased!

Roasbeef

Yay caching! I've been running this on a few of my mainnet nodes and have noticed that combined with the gossip sync PR, it pretty much does away with the large memory burst a node can see today on restart.

No major comments, other than pointing out a future direction that we may want to consider where we update the cache with a new entry rather than removing the entry from the cache. Eventually, we can also extend this channel cache to be used in things like path finding or computing DescribeGraph, etc.

channeldb/graph.go

channeldb/db.go

Roasbeef · 2019-03-28T23:51:20Z

channeldb/reject_cache.go

+	// rejectFlagExists is a flag indicating whether the channel exists,
+	// i.e. the channel is open and has a recent channel update. If this
+	// flag is not set, the channel is either a zombie or unknown.
+	rejectFlagExists = 1 << 0


Seems like a safe opportunity to use iota, and also declare these to be typed as rejectFlags rather than being an uint8.

channeldb/reject_cache.go

channeldb/channel_cache.go

Roasbeef · 2019-03-29T00:10:17Z

channeldb/graph.go

+		return err
+	}
+
+	c.rejectCache.remove(edge.ChannelID)


Alternatively, we can insert the new copy of the edge policy into the cache. I see no issue in delaying this to a distinct change though once we see how this fares in the wild once most of the network has updated.

channeldb/db.go

config.go

wpaulino

There are certain peers that request a full dump of all known channels, which will require going to disk (unless of course the channel cache is configured to hold the entire graph in memory). AFAIK, CL currently queries for the entire range on each connection, and after #2740 LND nodes will begin doing so roughly once every six hours to a random peer to ensure that any holes in its routing table are filled.

This won't be too bad since we'll only request all the channel_ids we don't know of that the remote peer does. After this is done once, the cost of a historical sync should be pretty negligible as long as we don't go offline for a long period of time.

channeldb/channel_cache.go

channeldb/graph.go

Roasbeef

LGTM 🍭

Have been testing this out on my node over the past few days, and it's significantly helped with the initial allocation burst when connecting to peers for the first time. I anticipate we'll also get a lot of feedback from node operators during the RC cycle as well, which can be used to modify the cache write policies or default sizes. Needs a rebase to remove the Drawin travis timing commits!

channeldb/channel_cache.go

channeldb/graph.go

halseth

🔥

channeldb/graph.go

lncfg/caches.go

config.go

channeldb/graph.go

halseth · 2019-04-01T10:49:04Z

channeldb/graph.go

 	})
 	if err != nil {
 		return err
 	}

-	c.rejectCache.remove(edge.ChannelID)
-	c.chanCache.remove(edge.ChannelID)
+	if entry, ok := c.rejectCache.get(edge.ChannelID); ok {


Better to just remove IMO, we have no guarantees that the updated policy doesn't change the value of the rejectFlags.

the reject flags are only changed wrt to channel existence/zombie pruning. updating the edge policy shouldn't modify rejectFlags

channeldb/graph.go

cfromknecht · 2019-04-01T22:48:11Z

@Roasbeef @wpaulino @halseth latest version is up and comments addressed, ptal

This reverts commit 3555d3d.

…use Batch" This reverts commit e8da6dd.

This reverts commit da76c34.

This commit introduces the Validator interface, which is intended to be implemented by any sub configs. It specifies a Validate() error method that should fail if a sub configuration contains any invalid or insane parameters. In addition, a package-level Validate method can be used to check a variadic number of sub configs implementing the Validator interface. This allows the primary config struct to be extended via targeted and/or specialized sub configs, and validate all of them in sequence without bloating the main package with the actual validation logic.

wpaulino

LGTM 🛰

cfromknecht force-pushed the reject-and-channel-cache branch from 3a4ed58 to c131a21 Compare March 28, 2019 03:55

Roasbeef added this to the 0.6 milestone Mar 28, 2019

cfromknecht force-pushed the reject-and-channel-cache branch from c131a21 to 4aab755 Compare March 28, 2019 11:29

Roasbeef requested changes Mar 29, 2019

View reviewed changes

wpaulino reviewed Mar 29, 2019

View reviewed changes

channeldb/channel_cache.go Show resolved Hide resolved

channeldb/graph.go Outdated Show resolved Hide resolved

cfromknecht force-pushed the reject-and-channel-cache branch from 4aab755 to a9da15c Compare March 29, 2019 02:33

Roasbeef approved these changes Mar 29, 2019

View reviewed changes

channeldb/channel_cache.go Show resolved Hide resolved

channeldb/graph.go Show resolved Hide resolved

Roasbeef reviewed Mar 29, 2019

View reviewed changes

channeldb/graph.go Show resolved Hide resolved

halseth suggested changes Apr 1, 2019

View reviewed changes

halseth mentioned this pull request Apr 1, 2019

lnd is writing to disk almost all the time #2860

Closed

cfromknecht force-pushed the reject-and-channel-cache branch from a9da15c to 0a0cd51 Compare April 1, 2019 22:10

cfromknecht added 16 commits April 1, 2019 16:25

Revert "channeldb: convert all Update calls to use Batch"

f6d93c8

This reverts commit 3555d3d.

Revert "channeldb: convert concurrent channel state machine calls to …

e35f676

…use Batch" This reverts commit e8da6dd.

Revert "channeldb: convert invoice settle/cancel calls to use Batch"

2f92749

This reverts commit da76c34.

channeldb/db: remove unused buffer pool

ac315fd

channeldb/graph: add newChannelGraph constructor

ecbb786

channeldb/db: init one ChannelGraph per channeldb.DB

3c46cee

channeldb/reject_cache: add rejectCache w/ randomized eviction

af0ea35

channeldb/channel_cache: add channelCache w/ randomized eviction

b20a254

channeldb/graph+db: integrate reject and channel caches

ae3a00a

channeldb/options: add Options w/ functional modifiers

baa968b

channeldb: accept cache sizes in ChannelGraph

6d3e081

lncfg/caches: adds Caches sub config

7504d46

config: expose Caches subconfig

4336659

lnd: pass CLI reject+channel cache sizes to channeldb Open

63b15fd

channeldb: write chan updates through reject+channel cache

5d98a94

cfromknecht force-pushed the reject-and-channel-cache branch from 0a0cd51 to 5d98a94 Compare April 1, 2019 23:37

cfromknecht mentioned this pull request Apr 2, 2019

channeldb: cache HasChannelEdge queries #2697

Closed

cfromknecht added database Related to the database/storage of LND optimization gossip labels Apr 2, 2019

wpaulino approved these changes Apr 2, 2019

View reviewed changes

Roasbeef merged commit 1dc1e85 into lightningnetwork:master Apr 3, 2019

cfromknecht deleted the reject-and-channel-cache branch April 3, 2019 02:14

cfromknecht mentioned this pull request Jun 6, 2019

routing/router: remove router-level reject cache #3170

Merged

channeldb: add reject and channel caches #2847

channeldb: add reject and channel caches #2847

Uh oh!

Conversation

cfromknecht commented Mar 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reject Cache

Channel Cache

Eviction

Lazy Consistency

CLI configuration

Uh oh!

Roasbeef commented Mar 28, 2019

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Roasbeef Mar 28, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Roasbeef Mar 29, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wpaulino left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

halseth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

halseth Apr 1, 2019

Choose a reason for hiding this comment

Uh oh!

cfromknecht Apr 1, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cfromknecht commented Apr 1, 2019

Uh oh!

wpaulino left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cfromknecht commented Mar 28, 2019 •

edited

Loading