c613ede9fa
This change is needed for enabling floating IP's on routed networks. To be able to create a subnet that spans all segments of a routed network, a special subnet service type of 'network:routed' is used to denote a network that can span all segments of a routed network. To create floating IP's on a routed network, the subnet must be created with a service_type of 'network:routed'. After the subnet has been created, floating IP's can be allocated and associated as before. See the design spec https://review.opendev.org/#/c/486450/ for reference. One caveat for this approach is that it requires the underlying infrastructure to be aware of and able to route /32 host routes for the floating IP. This implies that in practice, use of the 'network:routed' service type should be done in conjunction with one or both of the following: 1. Third-party SDN backend that handles this service type in its own way 2. neutron-dynamic-routing and the BGP service plugin for announcing the appropriate next-hops for floating IP's. This is compatible with DVR and non-DVR environments. Depends-On: Ibde33bdacba6bd1e9c41cc69d0054bf55e1e6454 Change-Id: I9ae9d193b885364d5a4d90538880d8e9fbc8df74 Co-Author: Thomas Goirand <zigo@debian.org> Partial-Bug: #1667329
465 lines
17 KiB
ReStructuredText
465 lines
17 KiB
ReStructuredText
.. _config-bgp-floating-ip-over-l2-segmented-network:
|
|
|
|
==========================================
|
|
BGP floating IPs over l2 segmented network
|
|
==========================================
|
|
|
|
The general principle is that L2 connectivity will be bound to a single rack.
|
|
Everything outside the switches of the rack will be routed using BGP. To
|
|
perform the BGP announcement, neutron-dynamic-routing is leveraged.
|
|
|
|
To achieve this, on each rack, servers are setup with a different management
|
|
network using a vlan ID per rack (light green and orange network below).
|
|
Note that a unique vlan ID per rack isn't mandatory, it's also possible to
|
|
use the same vlan ID on all racks. The point here is only to isolate L2
|
|
segments (typically, routing between the switch of each racks will be done
|
|
over BGP, without L2 connectivity).
|
|
|
|
|
|
.. image:: figures/bgp-floating-ip-over-l2-segmented-network.png
|
|
|
|
|
|
On the OpenStack side, a provider network must be setup, which is using a
|
|
different subnet range and vlan ID for each rack. This includes:
|
|
|
|
* an address scope
|
|
|
|
* some network segments for that network, which are attached to a named
|
|
physical network
|
|
|
|
* a subnet pool using that address scope
|
|
* one provider network subnet per segment (each subnet+segment pair matches
|
|
one rack physical network name)
|
|
|
|
A segment is attached to a specific vlan and physical network name. In the
|
|
above figure, the provider network is represented by 2 subnets: the dark green
|
|
and the red ones. The dark green subnet is on one network segment, and the red
|
|
one on another. Both subnet are of the subnet service type
|
|
"network:floatingip_agent_gateway", so that they cannot be used by virtual
|
|
machines directly.
|
|
|
|
On top of all of this, a floating IP subnet without a segment is added, which
|
|
spans in all of the racks. This subnet must have the below service types:
|
|
|
|
* network:routed
|
|
|
|
* network:floatingip
|
|
|
|
* network:router_gateway
|
|
|
|
Since the network:routed subnet isn't bound to a segment, it can be used on all
|
|
racks. As the service types network:floatingip and network:router_gateway are
|
|
used for the provider network, the subnet can only be used for floating IPs and
|
|
router gateways, meaning that the subnet using segments will be used as
|
|
floating IP gateways (ie: the next HOP to reach these floating IP / router
|
|
external gateways).
|
|
|
|
|
|
Configuring the Neutron API side
|
|
--------------------------------
|
|
|
|
On the controller side (ie: API and RPC server), only the Neutron Dynamic
|
|
Routing Python library must be installed (for example, in the Debian case,
|
|
that would be the neutron-dynamic-routing-common and
|
|
python3-neutron-dynamic-routing packages). On top of that, "segments" and
|
|
"bgp" must be added to the list of plugins in service_plugins. For example
|
|
in neutron.conf:
|
|
|
|
.. code-block:: ini
|
|
|
|
[DEFAULT]
|
|
service_plugins=router,metering,qos,trunk,segments,bgp
|
|
|
|
|
|
The BGP agent
|
|
-------------
|
|
|
|
The neutron-bgp-agent must be installed. Best is to install it twice per rack,
|
|
on any machine (it doesn't mater much where). Then each of these BGP agents
|
|
will establish a session with one switch, and advertise all of the BGP
|
|
configuration.
|
|
|
|
|
|
Setting-up BGP peering with the switches
|
|
----------------------------------------
|
|
|
|
A peer that represents the network equipment must be created. Then a matching
|
|
BGP speaker needs to be created. Then, the BGP speaker must be
|
|
associated to a dynamic-routing-agent (in our example, the dynamic-routing
|
|
agents run on compute 1 and 4). Finally, the peer is added to the BGP speaker,
|
|
so the speaker initiates a BGP session to the network equipment.
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Create a BGP peer to represent the switch 1,
|
|
$ # which runs FRR on 10.1.0.253 with AS 64601
|
|
$ openstack bgp peer create \
|
|
--peer-ip 10.1.0.253 \
|
|
--remote-as 64601 \
|
|
rack1-switch-1
|
|
|
|
$ # Create a BGP speaker on compute-1
|
|
$ BGP_SPEAKER_ID_COMPUTE_1=$(openstack bgp speaker create \
|
|
--local-as 64999 --ip-version 4 mycloud-compute-1.example.com \
|
|
--format value -c id)
|
|
|
|
$ # Get the agent ID of the dragent running on compute 1
|
|
$ BGP_AGENT_ID_COMPUTE_1=$(openstack network agent list \
|
|
--host mycloud-compute-1.example.com --agent-type bgp \
|
|
--format value -c ID)
|
|
|
|
$ # Add the BGP speaker to the dragent of compute 1
|
|
$ openstack bgp dragent add speaker \
|
|
${BGP_AGENT_ID_COMPUTE_1} ${BGP_SPEAKER_ID_COMPUTE_1}
|
|
|
|
$ # Add the BGP peer to the speaker of compute 1
|
|
$ openstack bgp speaker add peer \
|
|
compute-1.example.com rack1-switch-1
|
|
|
|
$ # Tell the speaker not to advertize tenant networks
|
|
$ openstack bgp speaker set \
|
|
--no-advertise-tenant-networks mycloud-compute-1.example.com
|
|
|
|
|
|
It is possible to repeat this operation for a 2nd machine on the same rack,
|
|
if the deployment is using bonding (and then, LACP between both switches),
|
|
as per the figure above. It also can be done on each rack. One way to
|
|
deploy is to select two computers in each rack (for example, one compute
|
|
node and one network node), and install the neutron-dynamic-routing-agent
|
|
on each of them, so they can "talk" to both switches of the rack. All of
|
|
this depends on what the configuration is on the switch side. It may be
|
|
that you only need to talk to two ToR racks in the whole deployment. The
|
|
thing you must know is that you can deploy as many dynamic-routing agent
|
|
as needed, and that one agent can talk to a single device.
|
|
|
|
|
|
Setting-up physical network names
|
|
---------------------------------
|
|
|
|
Before setting-up the provider network, the physical network name must be set
|
|
in each host, according to the rack names. On the compute or network nodes,
|
|
this is done in /etc/neutron/plugins/ml2/openvswitch_agent.ini using the
|
|
bridge_mappings directive:
|
|
|
|
.. code-block:: ini
|
|
|
|
[ovs]
|
|
bridge_mappings = physnet-rack1:br-ex
|
|
|
|
|
|
All of the physical networks created this way must be added in the
|
|
configuration of the neutron-server as well (ie: this is used by both
|
|
neutron-api and neutron-rpc-server). For example, with 3 racks,
|
|
here's how /etc/neutron/plugins/ml2/ml2_conf.ini should look like:
|
|
|
|
.. code-block:: ini
|
|
|
|
[ml2_type_flat]
|
|
flat_networks = physnet-rack1,physnet-rack2,physnet-rack3
|
|
|
|
[ml2_type_vlan]
|
|
network_vlan_ranges = physnet-rack1,physnet-rack2,physnet-rack3
|
|
|
|
|
|
Once this is done, the provider network can be created, using physnet-rack1
|
|
as "physical network".
|
|
|
|
|
|
Setting-up the provider network
|
|
-------------------------------
|
|
|
|
Everything that is in the provider network's scope will be advertised through
|
|
BGP. Here is how to create the network scope:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Create the address scope
|
|
$ openstack address scope create --share --ip-version 4 provider-addr-scope
|
|
|
|
|
|
Then, the network can be ceated using the physical network name set above:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Create the provider network that spawns over all racks
|
|
$ openstack network create --external --share \
|
|
--provider-physical-network physnet-rack1 \
|
|
--provider-network-type vlan \
|
|
--provider-segment 11 \
|
|
provider-network
|
|
|
|
|
|
This automatically creates a network AND a segment. Though by default, this
|
|
segment has no name, which isn't convenient. This name can be changed though:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Get the network ID:
|
|
$ PROVIDER_NETWORK_ID=$(openstack network show provider-network \
|
|
--format value -c id)
|
|
|
|
$ # Get the segment ID:
|
|
$ FIRST_SEGMENT_ID=$(openstack network segment list \
|
|
--format csv -c ID -c Network | \
|
|
q -H -d, "SELECT ID FROM - WHERE Network='${PROVIDER_NETWORK_ID}'")
|
|
|
|
$ # Set the 1st segment name, matching the rack name
|
|
$ openstack network segment set --name segment-rack1 ${FIRST_SEGMENT_ID}
|
|
|
|
|
|
Setting-up the 2nd segment
|
|
--------------------------
|
|
|
|
The 2nd segment, which will be attached to our provider network, is created
|
|
this way:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Create the 2nd segment, matching the 2nd rack name
|
|
$ openstack network segment create \
|
|
--physical-network physnet-rack2 \
|
|
--network-type vlan \
|
|
--segment 13 \
|
|
--network provider-network \
|
|
segment-rack2
|
|
|
|
|
|
Setting-up the provider subnets for the BGP next HOP routing
|
|
------------------------------------------------------------
|
|
|
|
These subnets will be in use in different racks, depending on what physical
|
|
network is in use in the machines. In order to use the address scope, subnet
|
|
pools must be used. Here is how to create the subnet pool with the two ranges
|
|
to use later when creating the subnets:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Create the provider subnet pool which includes all ranges for all racks
|
|
$ openstack subnet pool create \
|
|
--pool-prefix 10.1.0.0/24 \
|
|
--pool-prefix 10.2.0.0/24 \
|
|
--address-scope provider-addr-scope \
|
|
--share \
|
|
provider-subnet-pool
|
|
|
|
|
|
Then, this is how to create the two subnets. In this example, we are keeping
|
|
the addresses in .1 for the gateway, .2 for the DHCP server, and .253 +.254,
|
|
as these addresses will be used by the switches for the BGP announcements:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Create the subnet for the physnet-rack-1, using the segment-rack-1, and
|
|
$ # the subnet_service_type network:floatingip_agent_gateway
|
|
$ openstack subnet create \
|
|
--service-type 'network:floatingip_agent_gateway' \
|
|
--subnet-pool provider-subnet-pool \
|
|
--subnet-range 10.1.0.0/24 \
|
|
--allocation-pool start=10.1.0.3,end=10.1.0.252 \
|
|
--gateway 10.1.0.1 \
|
|
--network provider-network \
|
|
--network-segment segment-rack1 \
|
|
provider-subnet-rack1
|
|
|
|
$ # The same, for the 2nd rack
|
|
$ openstack subnet create \
|
|
--service-type 'network:floatingip_agent_gateway' \
|
|
--subnet-pool provider-subnet-pool \
|
|
--subnet-range 10.2.0.0/24 \
|
|
--allocation-pool start=10.2.0.3,end=10.2.0.252 \
|
|
--gateway 10.2.0.1 \
|
|
--network provider-network \
|
|
--network-segment segment-rack2 \
|
|
provider-subnet-rack2
|
|
|
|
|
|
Note the service types. network:floatingip_agent_gateway makes sure that these
|
|
subnets will be in use only as gateways (ie: the next BGP hop). The above can
|
|
be repeated for each new rack.
|
|
|
|
|
|
Adding a subnet for VM floating IPs and router gateways
|
|
-------------------------------------------------------
|
|
|
|
This is to be repeated each time a new subnet must be created for floating IPs
|
|
and router gateways. First, the range is added in the subnet pool, then the
|
|
subnet itself is created:
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Add a new prefix in the subnet pool for the floating IPs:
|
|
$ openstack subnet pool set \
|
|
--pool-prefix 203.0.113.0/24 \
|
|
provider-subnet-pool
|
|
|
|
$ # Create the floating IP subnet
|
|
$ openstack subnet create vm-fip \
|
|
--service-type 'network:routed' \
|
|
--service-type 'network:floatingip' \
|
|
--service-type 'network:router_gateway' \
|
|
--subnet-pool provider-subnet-pool \
|
|
--subnet-range 203.0.113.0/24 \
|
|
--network provider-network
|
|
|
|
The service-type network:routed ensures we're using BGP through the provider
|
|
network to advertize the IPs. network:floatingip and network:router_gateway
|
|
limits the use of these IPs to floating IPs and router gateways.
|
|
|
|
Setting-up BGP advertizing
|
|
--------------------------
|
|
|
|
The provider network needs to be added to each of the BGP speakers. This means
|
|
each time a new rack is setup, the provider network must be added to the 2 BGP
|
|
speakers of that rack.
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Add the provider network to the BGP speakers.
|
|
$ openstack bgp speaker add network \
|
|
mycloud-compute-1.example.com provider-network
|
|
$ openstack bgp speaker add network \
|
|
mycloud-compute-4.example.com provider-network
|
|
|
|
|
|
In this example, we've selected two compute nodes that are also running an
|
|
instance of the neutron-dynamic-routing-agent daemon.
|
|
|
|
|
|
Per project operation
|
|
---------------------
|
|
|
|
This can be done by each customer. A subnet pool isn't mandatory, but it is
|
|
nice to have. Typically, the customer network will not be advertized through
|
|
BGP (but this can be done if needed).
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Create the tenant private network
|
|
$ openstack network create tenant-network
|
|
|
|
$ # Self-service network pool:
|
|
$ openstack subnet pool create \
|
|
--pool-prefix 192.168.130.0/23 \
|
|
--share \
|
|
tenant-subnet-pool
|
|
|
|
$ # Self-service subnet:
|
|
$ openstack subnet create \
|
|
--network tenant-network \
|
|
--subnet-pool tenant-subnet-pool \
|
|
--prefix-length 24 \
|
|
tenant-subnet-1
|
|
|
|
$ # Create the router
|
|
$ openstack router create tenant-router
|
|
|
|
$ # Add the tenant subnet to the tenant router
|
|
$ openstack router add subnet \
|
|
tenant-router tenant-subnet-1
|
|
|
|
$ # Set the router's default gateway. This will use one public IP.
|
|
$ openstack router set \
|
|
--external-gateway provider-network tenant-router
|
|
|
|
$ # Create a first VM on the tenant subnet
|
|
$ openstack server create --image debian-10.5.0-openstack-amd64.qcow2 \
|
|
--flavor cpu2-ram6-disk20 \
|
|
--nic net-id=tenant-network \
|
|
--key-name yubikey-zigo \
|
|
test-server-1
|
|
|
|
$ # Eventually, add a floating IP
|
|
$ openstack floating ip create provider-network
|
|
+---------------------+--------------------------------------+
|
|
| Field | Value |
|
|
+---------------------+--------------------------------------+
|
|
| created_at | 2020-12-15T11:48:36Z |
|
|
| description | |
|
|
| dns_domain | None |
|
|
| dns_name | None |
|
|
| fixed_ip_address | None |
|
|
| floating_ip_address | 203.0.113.17 |
|
|
| floating_network_id | 859f5302-7b22-4c50-92f8-1f71d6f3f3f4 |
|
|
| id | 01de252b-4b78-4198-bc28-1328393bf084 |
|
|
| name | 203.0.113.17 |
|
|
| port_details | None |
|
|
| port_id | None |
|
|
| project_id | d71a5d98aef04386b57736a4ea4f3644 |
|
|
| qos_policy_id | None |
|
|
| revision_number | 0 |
|
|
| router_id | None |
|
|
| status | DOWN |
|
|
| subnet_id | None |
|
|
| tags | [] |
|
|
| updated_at | 2020-12-15T11:48:36Z |
|
|
+---------------------+--------------------------------------+
|
|
$ openstack server add floating ip test-server-1 203.0.113.17
|
|
|
|
Cumulus switch configuration
|
|
----------------------------
|
|
|
|
Because of the way Neutron works, for each new port associated with an IP
|
|
address, a GARP is issued, to inform the switch about the new MAC / IP
|
|
association. Unfortunately, this confuses the switches where they may think
|
|
they should use local ARP table to route the packet, rather than giving it to
|
|
the next HOP to route. The definitive solution would be to patch Neutron to
|
|
make it stop sending GARP for any port on a subnet with the network:routed
|
|
service type. Such patch would be hard to write, though lucky, there's a fix
|
|
that works (at least with Cumulus switches). Here's how.
|
|
|
|
In /etc/network/switchd.conf we change this:
|
|
|
|
.. code-block:: ini
|
|
|
|
# configure a route instead of a neighbor with the same ip/mask
|
|
#route.route_preferred_over_neigh = FALSE
|
|
route.route_preferred_over_neigh = TRUE
|
|
|
|
and then simply restart switchd:
|
|
|
|
.. code-block:: console
|
|
|
|
systemctl restart switchd
|
|
|
|
This reboots the switch ASIC of the switch, so it may be a dangerous thing to
|
|
do with no switch redundancy (so be careful when doing it). The completely safe
|
|
procedure, if having 2 switches per rack, looks like this:
|
|
|
|
.. code-block:: console
|
|
|
|
# save clagd priority
|
|
OLDPRIO=$(clagctl status | sed -r -n 's/.*Our.*Role: ([0-9]+) 0.*/\1/p')
|
|
# make sure that this switch is not the primary clag switch. otherwise the
|
|
# secondary switch will also shutdown all interfaces when loosing contact
|
|
# with the primary switch.
|
|
clagctl priority 16535
|
|
|
|
# tell neighbors to not route through this router
|
|
vtysh
|
|
vtysh# router bgp 64999
|
|
vtysh# bgp graceful-shutdown
|
|
vtysh# exit
|
|
systemctl restart switchd
|
|
clagctl priority $OLDPRIO
|
|
|
|
Verification
|
|
------------
|
|
|
|
If everything goes well, the floating IPs are advertized over BGP through the
|
|
provider network. Here is an example with 4 VMs deployed on 2 racks. Neutron
|
|
is here picking-up IPs on the segmented network as Nexthop.
|
|
|
|
.. code-block:: console
|
|
|
|
$ # Check the advertized routes:
|
|
$ openstack bgp speaker list advertised routes \
|
|
mycloud-compute-4.example.com
|
|
+-----------------+-----------+
|
|
| Destination | Nexthop |
|
|
+-----------------+-----------+
|
|
| 203.0.113.17/32 | 10.1.0.48 |
|
|
| 203.0.113.20/32 | 10.1.0.65 |
|
|
| 203.0.113.40/32 | 10.2.0.23 |
|
|
| 203.0.113.55/32 | 10.2.0.35 |
|
|
+-----------------+-----------+
|