You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NOTE I wrote this coming off a nasty fever and eventually I also ran out of steam to make this better. I'm not sure how well I pieced this together, but it feels like putting something down is better than nothing.
EDIT 09.13.22: Added section at end about inner ethernet header.
Today, myself, @rmustacc, @rcgoodfellow, @bnaecker, and @jmpesp had a call about the current and future plan for forwarding packets for the purpose of VPC delivery. Specifically, what program state and actions are required at Boundary Services and OPTE for the appropriate delivery of a VPC packet. We agreed that something should be written down somewhere to record our decisions from this call so that our future selves have something to call back to when we inevitably ask "why are we doing it this way?". I'm writing up this issue for a lack of better idea on where this should go (honestly it feels like an addendum RFC that points back to RFC 63 is the proper place to eventually put this). Also, if we change how we implement packet forwarding in the future it would require OPTE changes, so it doesn't hurt to start by documenting the discussion in an issue.
What is "packet forwarding"?
Packet forwarding is the method by which we uniquely identify a packet's destination. How we address/deliver packets in the Oxide VPC network. There are two places in the Oxide VPC network where forwarding occurs.
Boundary Services. At the edge or our rack, where we interface with the external (customer) network. Any traffic that is destined for a customer's IP or an internet IP must be routed to Boundary Services.
OPTE . At the opposite edge, directly attached to the guest instance. OPTE is essentially a virtual gateway to the guest. It's the last stop for a packet destined to the guest, and it's the first stop for a packet leaving the guest.
Boundary Services Forwarding
Boundary Services needs to forward packets coming inbound from the external network, destined for a guest's external IP (which I'll shorten to EIP), to the correct physical sled where that guest lives. Boundary Services keeps NAT state that determines which sled a given <destination EIP> + <destination L4 port> should be delivered to. An EIP can be one of 3 things:
An ephemeral IP: an external IP that is temporarily loaned to a guest for the purpose of both inbound and outbound connectivity to the customer's network and the internet. It's mapping is valid for the lifetime of the guest
A floating IP: like an ephemeral IP, but its lifetime exists beyond a guest's. Its mapping changes when its guest attachment changes.
An SNAT IP: an external IP that is temporarily loaned to a guest for the purpose of outbound connectivity only. This type of EIP maps to one or more guests by divvying up the L4 port space, assigning ownership of a chunk of port space to a running guest. The guests that share an SNAT IP do no necessarily live on the same physical sled. This necessitates an entry for each SNAT IP + Port range space.
For example, let's assume the customer has deployed our rack in their internal IPv4 network of 10.77.0.0/16:
This table says that all packets destined for VIP 10.77.1.2 are forwarded to sled9 and should have their inner dest MAC address rewritten to A8:40:25:FF:00:01. In this case the entire port space is mapped for this IP, which means that guest instance has a dedicated external IP -- either ephemeral or floating.
Any packets destined for 10.77.1.3:1-4095 are forwarded to sled11 and should have their inner dest MAC address rewritten to A8:40:25:FF:00:02. In this case only a portion of the port space is mapped for the external IP -- this is an SNAT EIP.
When forwarding packets, Boundary Services has to encapsulate and potentially rewrite part of the packet in order for OPTE to have enough information to perform its own discrimination. Exactly what the information is the crux of this issue, as it determines the state Boundary Services needs to keep and the data it needs to encode in the packet. But first, I'll go over discrimination from the perspective of OPTE.
OPTE Forwarding
Each guest NIC has a corresponding OPTE "Port". The "Port" abstraction is meant to match that of a switch port. On each sled there may be many guest instances. Each guest instance may have one or more virtual NICs (that is, for our HVM we choose to emulate at the L2 layer by way of virtio devices, but we could also have chosen to do L3 instead). Inside of OPTE there is a simple virtual switch that maps packets to OPTE Ports -- just like any physical switch would (with the exception that this is not a learning switch, but one that is statically programmed by a control plane layer). This virtual switch needs a way to discriminate a packet to its OPTE Port. Using just the guest MAC address is not enough. The way we have defined the Oxide VPC Network thus far is that it provides L3 connectivity but still has some notion of L2 beneath it incase we decide to provide L2 emulation in the future. Given our limited L2 space (some space is reserved for Oxide OUI + physical vs virtual space), we decided that each VPC is its own L2 domain. This is not much different than VLANs, but instead we map a given VPC to a Geneve VNI, giving us more space to play with and much more flexibility with how we implement our VPC networks.
Essentially: OPTE needs to pick the correct Port to forward to, and currently it does that by the unique combination of VNI + guest MAC address.
The nice thing about this method of discrimination is it works for both guest-to-guest traffic and external-to-guest traffic. In the former boundary services isn't involved at all. Rather the packets are shuttled directly from one OPTE Port to another (whether that be on the same physical sled, or from one sled to another). But this decision has ramifications for state Boundary Services has to keep, and that's important because it runs on our switch hardware, whose resources are precious to us.
Boundary Services and OPTE
In order for OPTE to forward correctly it must have the requisite information available in the packet. In the case of EIP traffic it's up to Boundary Services to make sure that data is there. Currently, that means Boundary Services needs to make sure to populate the packet with the correct guest instance VNI (the VPC identifier) and correct guest MAC address. It does this by tracking the state mentioned above and then performing the following transformations on each packet:
Encapsulating the packet with the correct physical sled destination address. This makes sure the rack's physical network can deliver the encapsulated packet to the correct sled.
Encapsulating the packet with the correct Geneve VNI.
Rewriting the inner destination MAC address to be that of the guest.
But I thought Oxide VPC is L3 only?
RFDs 21 and 63 make it very clear that today's VPC is predicated exclusively on unicast L3 connectivity. There is no L2 domain available to the guest. All guests are provided a /32//128 IP and forced to speak to an off-subnet gateway, aka the "virtual gateway": OPTE. However, since our HVM implementation provides virtual NICs, we need to have unique MAC address. Technically, these MAC addresses only need to be unique at the guest level, not the VPC level. So we could play games here were every guest has the same set of unique MAC addresses, but for now we chose to have each guest VNIC presented with a unique MAC address. So, at the end of the day, we need to provide some notion of L2 to the guest in order for our HVM to work.
The question is: how far does that L2 information need to be propagated?
We could choose a different way to forward packets, that uses the L3 information instead. The key for this to work is that Boundary Services and OPTE need to agree on how packets are being forwarded. Let's explore this.
Rather than worrying about the inner MAC address, Boundary Services would only make sure to encapsulate the packet with the correct Geneve VNI. This means it would no longer need to keep state that knows how to map EIP to guest MAC address. However, it still needs to write something for the inner dest MAC address, as a normal router would do. In this case we come back to the fact that OPTE is a virtual gateway. That virtual gateway has a special MAC that is used for all OPTE Ports in the system, the "lucky sevens" MAC address: A8:40:25:FF:77:77. The fact that this MAC address is not unique means that OPTE can no longer use it for making forwarding decisions, its logic would have to change. What does that look like?
OPTE L3-based forwarding
In this world a packet arrives to the OPTE virtual switch with the guest's VNI and the guest's EIP (assuming we are talking about traffic from the external network). These two pieces of information are enough to uniquely identify and forward the packet to the correct OPTE Port. However, OPTE's switching mechanism needs to change for this to work. The incoming packet would have an inner dest MAC address that is always equal to A8:40:25:FF:77:77, so how do we change OPTE to work with that?
I think the best solution is to keep the existing OPTE switching mechanism in place: namely VNI + inner MAC address. This means that OPTE needs to rewrite the inner MAC before switching. In order to do that it needs to have a global mapping of EIP -> guest MAC address. This could be structured similarly to how the virtual-to-physical resource is dealt with in OPTE: where there is a single mapping shared between all the ports. Anytime the control plane adds a new EIP to a Port the global mapping would also need updating.
What about mac flows?
In today's implementation OPTE uses promisc mode to get all inbound traffic and then performs its own filtering. This means in addition to legit VPC traffic, it also sees traffic for the underlying physical network, which it has no business seeing (and creates needless filtering work). The goal is to eventually use the mac flows subsystem, as described in #61. The first step is to just use the mac flow system to delivery any Geneve encapsulated traffic to xde/OPTE, and let the virtual switch perform software classification to determine which Port to forward it to. But eventually we will want to push more filtering to mac flows (which can subsequently push it down to hardware if we start making use of dedicated rings for individual OPTE Ports). The question is how easily mac flows could be adapted to dealing with a many-to-one relationship, where there are many flows that map to the same place. That is, we would need to configure mac flows with knowledge of both VIPs (the private virtual IPs for guests on that seld) as well as EIPs (the external IPs that map to guests on that sled), and for some EIPs that will mean mapping port ranges in the case of SNAT. This is different than the current arrangement today, where we can simply use the VNI + inner MAC address to map to a unique Port. That is, if Boundary Services stops rewriting the inner MAC address, then OPTE + mac flows need to gain the ability to forward based on inner L3 data.
Eschew the inner ethernet header?
If we moved to a world where we forward based solely on L3, then we could also consider removing the inner ethernet header. This came up while talking to @bnaecker, who rightfully brought up the question: "why does the oxide-vpc encap code rewrite the inner MAC dest?"
The Oxide VPC implementation rewrites the inner MAC dest so that the receiving OPTE knows how to forward it to the correct port. But traffic could just as easily be uniquely forwarded by inner IP dest. For packets destined to the Internet Gateway we rewrite the inner MAC dest to be that of Boundary Serivces, which currently is hard-coded to all zeros. We could give this a sentinel value, but it doesn't really serve any purpose. It has no bearing on actual delivery from OPTE to the Boundary Services process running on the switch.
Essentially, for the Oxide VPC implementation, all L2 emulation stops at OPTE. We could consider getting rid of the inner ethernet header on OPTE egress. Even if we were to provide more L2 emulation in the VPC, like real subnets and ARP, we could still probably emulate all of that without ever exposing the inner ethernet header past OPTE. As long as a given OPTE instance has enough information to reconsistute the guest frame during ingress, there is no reason to preserve the frame across the underlay network. Or, at least, I can't think of a reason at this very moment.
The text was updated successfully, but these errors were encountered:
There's also a middle ground to be explored here with Geneve header options. For example, within a VPC, if we had a unique N-bit identifier for each instance, then boundary services could place that identifier in the Geneve header and OPTE could discriminate against that. This could also work for instance-to-instance flows.
This is appealing in the sense that it elides the need for more complex discriminators like an SNAT IP + L4 port range. It's also nice to have all the discriminating information within a single header. But it's dissatisfying in the sense that it's creating a synthetic identifier when intrinsic ones already exist. Another knock is that OPTE still needs to map the discriminator to a MAC and rewrite it on the way to the guest. This reminds me a bit of the decisions that need to be made when choosing keys in a relational database haha.
FWIW, the existing sidecar code was written with the assumption that we were only doing L3 forwarding, and that L2 did indeed end at OPTE. When encapsulating incoming NAT traffic, it doesn't even include an inner ethernet header - the geneve header directly wraps the IPv4/IPv6 header.
Switching to the L2 model doesn't seem like it will be that difficult.
Uh oh!
There was an error while loading. Please reload this page.
NOTE I wrote this coming off a nasty fever and eventually I also ran out of steam to make this better. I'm not sure how well I pieced this together, but it feels like putting something down is better than nothing.
EDIT 09.13.22: Added section at end about inner ethernet header.
Today, myself, @rmustacc, @rcgoodfellow, @bnaecker, and @jmpesp had a call about the current and future plan for forwarding packets for the purpose of VPC delivery. Specifically, what program state and actions are required at Boundary Services and OPTE for the appropriate delivery of a VPC packet. We agreed that something should be written down somewhere to record our decisions from this call so that our future selves have something to call back to when we inevitably ask "why are we doing it this way?". I'm writing up this issue for a lack of better idea on where this should go (honestly it feels like an addendum RFC that points back to RFC 63 is the proper place to eventually put this). Also, if we change how we implement packet forwarding in the future it would require OPTE changes, so it doesn't hurt to start by documenting the discussion in an issue.
What is "packet forwarding"?
Packet forwarding is the method by which we uniquely identify a packet's destination. How we address/deliver packets in the Oxide VPC network. There are two places in the Oxide VPC network where forwarding occurs.
Boundary Services. At the edge or our rack, where we interface with the external (customer) network. Any traffic that is destined for a customer's IP or an internet IP must be routed to Boundary Services.
OPTE . At the opposite edge, directly attached to the guest instance. OPTE is essentially a virtual gateway to the guest. It's the last stop for a packet destined to the guest, and it's the first stop for a packet leaving the guest.
Boundary Services Forwarding
Boundary Services needs to forward packets coming inbound from the external network, destined for a guest's external IP (which I'll shorten to EIP), to the correct physical sled where that guest lives. Boundary Services keeps NAT state that determines which sled a given
<destination EIP> + <destination L4 port>
should be delivered to. An EIP can be one of 3 things:For example, let's assume the customer has deployed our rack in their internal IPv4 network of
10.77.0.0/16
:This table says that all packets destined for VIP
10.77.1.2
are forwarded tosled9
and should have their inner dest MAC address rewritten toA8:40:25:FF:00:01
. In this case the entire port space is mapped for this IP, which means that guest instance has a dedicated external IP -- either ephemeral or floating.Any packets destined for
10.77.1.3:1-4095
are forwarded tosled11
and should have their inner dest MAC address rewritten toA8:40:25:FF:00:02
. In this case only a portion of the port space is mapped for the external IP -- this is an SNAT EIP.When forwarding packets, Boundary Services has to encapsulate and potentially rewrite part of the packet in order for OPTE to have enough information to perform its own discrimination. Exactly what the information is the crux of this issue, as it determines the state Boundary Services needs to keep and the data it needs to encode in the packet. But first, I'll go over discrimination from the perspective of OPTE.
OPTE Forwarding
Each guest NIC has a corresponding OPTE "Port". The "Port" abstraction is meant to match that of a switch port. On each sled there may be many guest instances. Each guest instance may have one or more virtual NICs (that is, for our HVM we choose to emulate at the L2 layer by way of virtio devices, but we could also have chosen to do L3 instead). Inside of OPTE there is a simple virtual switch that maps packets to OPTE Ports -- just like any physical switch would (with the exception that this is not a learning switch, but one that is statically programmed by a control plane layer). This virtual switch needs a way to discriminate a packet to its OPTE Port. Using just the guest MAC address is not enough. The way we have defined the Oxide VPC Network thus far is that it provides L3 connectivity but still has some notion of L2 beneath it incase we decide to provide L2 emulation in the future. Given our limited L2 space (some space is reserved for Oxide OUI + physical vs virtual space), we decided that each VPC is its own L2 domain. This is not much different than VLANs, but instead we map a given VPC to a Geneve VNI, giving us more space to play with and much more flexibility with how we implement our VPC networks.
Essentially: OPTE needs to pick the correct Port to forward to, and currently it does that by the unique combination of VNI + guest MAC address.
The nice thing about this method of discrimination is it works for both guest-to-guest traffic and external-to-guest traffic. In the former boundary services isn't involved at all. Rather the packets are shuttled directly from one OPTE Port to another (whether that be on the same physical sled, or from one sled to another). But this decision has ramifications for state Boundary Services has to keep, and that's important because it runs on our switch hardware, whose resources are precious to us.
Boundary Services and OPTE
In order for OPTE to forward correctly it must have the requisite information available in the packet. In the case of EIP traffic it's up to Boundary Services to make sure that data is there. Currently, that means Boundary Services needs to make sure to populate the packet with the correct guest instance VNI (the VPC identifier) and correct guest MAC address. It does this by tracking the state mentioned above and then performing the following transformations on each packet:
But I thought Oxide VPC is L3 only?
RFDs 21 and 63 make it very clear that today's VPC is predicated exclusively on unicast L3 connectivity. There is no L2 domain available to the guest. All guests are provided a
/32
//128
IP and forced to speak to an off-subnet gateway, aka the "virtual gateway": OPTE. However, since our HVM implementation provides virtual NICs, we need to have unique MAC address. Technically, these MAC addresses only need to be unique at the guest level, not the VPC level. So we could play games here were every guest has the same set of unique MAC addresses, but for now we chose to have each guest VNIC presented with a unique MAC address. So, at the end of the day, we need to provide some notion of L2 to the guest in order for our HVM to work.The question is: how far does that L2 information need to be propagated?
We could choose a different way to forward packets, that uses the L3 information instead. The key for this to work is that Boundary Services and OPTE need to agree on how packets are being forwarded. Let's explore this.
Rather than worrying about the inner MAC address, Boundary Services would only make sure to encapsulate the packet with the correct Geneve VNI. This means it would no longer need to keep state that knows how to map EIP to guest MAC address. However, it still needs to write something for the inner dest MAC address, as a normal router would do. In this case we come back to the fact that OPTE is a virtual gateway. That virtual gateway has a special MAC that is used for all OPTE Ports in the system, the "lucky sevens" MAC address:
A8:40:25:FF:77:77
. The fact that this MAC address is not unique means that OPTE can no longer use it for making forwarding decisions, its logic would have to change. What does that look like?OPTE L3-based forwarding
In this world a packet arrives to the OPTE virtual switch with the guest's VNI and the guest's EIP (assuming we are talking about traffic from the external network). These two pieces of information are enough to uniquely identify and forward the packet to the correct OPTE Port. However, OPTE's switching mechanism needs to change for this to work. The incoming packet would have an inner dest MAC address that is always equal to
A8:40:25:FF:77:77
, so how do we change OPTE to work with that?I think the best solution is to keep the existing OPTE switching mechanism in place: namely VNI + inner MAC address. This means that OPTE needs to rewrite the inner MAC before switching. In order to do that it needs to have a global mapping of
EIP -> guest MAC address
. This could be structured similarly to how the virtual-to-physical resource is dealt with in OPTE: where there is a single mapping shared between all the ports. Anytime the control plane adds a new EIP to a Port the global mapping would also need updating.What about mac flows?
In today's implementation OPTE uses promisc mode to get all inbound traffic and then performs its own filtering. This means in addition to legit VPC traffic, it also sees traffic for the underlying physical network, which it has no business seeing (and creates needless filtering work). The goal is to eventually use the mac flows subsystem, as described in #61. The first step is to just use the mac flow system to delivery any Geneve encapsulated traffic to xde/OPTE, and let the virtual switch perform software classification to determine which Port to forward it to. But eventually we will want to push more filtering to mac flows (which can subsequently push it down to hardware if we start making use of dedicated rings for individual OPTE Ports). The question is how easily mac flows could be adapted to dealing with a many-to-one relationship, where there are many flows that map to the same place. That is, we would need to configure mac flows with knowledge of both VIPs (the private virtual IPs for guests on that seld) as well as EIPs (the external IPs that map to guests on that sled), and for some EIPs that will mean mapping port ranges in the case of SNAT. This is different than the current arrangement today, where we can simply use the VNI + inner MAC address to map to a unique Port. That is, if Boundary Services stops rewriting the inner MAC address, then OPTE + mac flows need to gain the ability to forward based on inner L3 data.
Eschew the inner ethernet header?
If we moved to a world where we forward based solely on L3, then we could also consider removing the inner ethernet header. This came up while talking to @bnaecker, who rightfully brought up the question: "why does the oxide-vpc encap code rewrite the inner MAC dest?"
The Oxide VPC implementation rewrites the inner MAC dest so that the receiving OPTE knows how to forward it to the correct port. But traffic could just as easily be uniquely forwarded by inner IP dest. For packets destined to the Internet Gateway we rewrite the inner MAC dest to be that of Boundary Serivces, which currently is hard-coded to all zeros. We could give this a sentinel value, but it doesn't really serve any purpose. It has no bearing on actual delivery from OPTE to the Boundary Services process running on the switch.
Essentially, for the Oxide VPC implementation, all L2 emulation stops at OPTE. We could consider getting rid of the inner ethernet header on OPTE egress. Even if we were to provide more L2 emulation in the VPC, like real subnets and ARP, we could still probably emulate all of that without ever exposing the inner ethernet header past OPTE. As long as a given OPTE instance has enough information to reconsistute the guest frame during ingress, there is no reason to preserve the frame across the underlay network. Or, at least, I can't think of a reason at this very moment.
The text was updated successfully, but these errors were encountered: