Configure region controller on AWS

alfred-stokespace · 8 December 2023 23:47

Copying over from Question #708666 “Force rackd to communicate with different regi…” : Questions : Ubuntu (launchpad.net)

I have a regiond MAAS install in AWS. That regiond server likes to think of itself as having a private ip. Which is true, BUT it also has a public IP address that is accessible from the internet (confirmed with curl on port 4240).

I have a rackd computer that I’m trying to configure to talk to the regiond server (in a different network then the private ip of region server). This starts out well, I give the https address url to the command sudo maas init rack --maas-url https://....redacted/MAAS --secret ...also-redacted...

when I check the logs of rackd I find that it’s failing to perform RPC against the IP returned from the maas-url metadata listing. The ip it returns is the private ip address of the regiond server and that’s not routable.

Soooo… I need to force the public ip of the regiond server on the rack server.

The docs I’m seeing don’t really seem to address this issue for 3.4 so I’m going to go database crawling through your scheme and see if I can hack my way through it I guess.

please provide some indication of a less gross way to do this please.

alfred-stokespace · 8 December 2023 23:57

I can see the RPC endpoints in the table maasserver_regioncontrollerprocessendpoint
and that’s probably the source from which the rack server is getting the private ips

when I naively tried editing each of the four ports’ ip to be the public ip I got blocked by postgresql telling me that my ip address was not the correct type (inet). Not sure how to coerce an ip string into an inet type to make postgresql happy.

Would that have worked? Thoughts?

alfred-stokespace · 9 December 2023 00:07

I also tried this trick…

I can see that regiond wants to register it’s eth cards so I created dummy eth
cd /etc/systemd/network && touch vip.netdev vip.network
and populated those two files appropriately to represent the public ip that AWS has associated to that ec2 instance.

That had no effect on the advertised rpc endpoints. It did show up as an additional network under the region server though.

alfred-stokespace · 9 December 2023 00:42

Interesting…
So I figured out how to cast ip addresses in postgresql…

update maasserver_regioncontrollerprocessendpoint
set address = 'yyy.yy.yyy.y'::INET
where address = 'zz.zzz.z.zz'::INET;

However that resulted in a doubling of records.
The original records with private ip are still in that table… I’m guessing the regiond is populateing them when they are detected as missing.

alfred-stokespace · 9 December 2023 00:49

Okay so, it looks like regiond removes entries form that table which it can’t account for; so my updates are wiped.
I did race it and win, my rackd snuck in while I was updating (several times, fighting against regiond) and my rackd server is now showing up in the controllers.
Except it seems that once the table is wiped of those rows, the rackd server then polls again for rpc endpoints and looses connection…

dang it!

r00ta · 9 December 2023 13:13

hi @alfred-stokespace ,

I suspect this is not going to be easy, unless you tunnel the traffic from the rack to the region with a vpn.

However, if you go for the hardest path, I think you have to:

you have to open all the ports that MAAS uses (should be the range from 5238 to 5255 but I don’t recall by heart, I need to double check).
the maas url should be the public one - just re-initialize the region controller
the hard one: the rack controller will make a request to http://regionip:5240/MAAS/rpc to get all the rpc endpoints, and from what I see the region is providing the endpoints information with the private ip. I can’t look much into this right now but this https://github.com/maas/maas/blob/00e818eb2434af668fefed378bc9da36548bcd0c/src/provisioningserver/rpc/clusterservice.py#L1344 should be the starting point to see how we calculate the address for the rpc endpoints. If you manage to use the public ip there, I think it will work

alfred-stokespace · 11 December 2023 15:12

Thanks @r00ta
Yes… I see that code. I naively assumed I could just edit locally the clusterservice.py file at the snap location,… but alas snap’s are read-only file system mounts!
It does seem that those guard rails can be ripped off,… but… I think it’s pretty clear that this isn’t going to be a production solution to my problem at this point.

My next approach is going to be to install a second region controller but put it in the other network. That means that the 2nd region controller will use the same database as the 1st but will have network addresses that are local to the machines I’m trying to provision.

I have a concern that the two region controllers may need/want to communicate to each other?
Will see, I guess.

One thing I think I see here though is a feature request to allow configuring an alternate ip address for cases like this (ie. Hybrid cloud, some on-prem and some cloud services)

r00ta · 11 December 2023 15:48

I’m actually struggling a bit to see why you are trying to keep the region “far away” from the racks. Since there is traffic flowing from the regions to the racks (and images can be large), it’s better to keep them within the same datacenter.

If you could argue why this is something desiderable, it would help us to give more value to this feature request

alfred-stokespace · 11 December 2023 16:11

@r00ta

Our company utilizes a hybrid network. We have some on-prem networks and some cloud networks.
We favor cloud networks because we can pay AWS to solve problems they are good at solving.

That preference means I don’t want to host a PostGresQL db on-prem, I’d prefer that be RDS. And I’d prefer having AWS ALB’s serving SSL termination for the MaaS UI.
And at that point, why not use an ec2 instance to host one region server to be the UI presence for the company.

Now I get the concern about large images, but I have gigabit vpc links to-from on-prem to cloud and I’m not expecting the images to constantly be changing.

I understand that Rack server uses tftp and that tftp really wants a pretty simple network architecture since it doesn’t work great over something like NAT (ie. server makes a udp packet back to the client). So i’ve accepted that some things need to be local to the clients. Okay, so I have a couple on-prem rack servers that are on the same network and tftp works fine.

So, I’d like to have some things cloud and some things on-prem.

(2nd use case, less important that the first…, I’d like to be able to allow my engineers to do host provisioning from pop-up locations, their homes lets say, that are coinvent for them rather than forcing them to locate themselves where the network allows for it. In this case, asking them to run a rack server that reaches out to a protected central region server seems desirable)

alfred-stokespace · 11 December 2023 20:56

Continuing on with the journey;

I made my PostGreSQL instance available to the remote network. I was able to install region+rack on a spare metal host. I used the AWS+ALB maas url to perform the init.

The local(region+rack) can’t communicate with the cloud(region+rack) and that makes sense to me, we already established that remote can’t see the public ips assigned to the AWS EC2 instance so I expect it when I see a warning in the UI “50% connected to region controllers.”

I’ve confirmed that tftp is running on the local(region+rack) server and rackd.log is showing that it can at least connect to local(region).

However,… when I first attempted a net-boot I was greeted with a secure-boot failure on the test host.
This is odd as this wasn’t an issue when I was doing a trivial all-local test of MAAS. I’m going to guess that you sign your pxelinux.0 img with a key that matches back to source host or something? Not sure what that’s about…

So…, after I disable secure boot in the bios, I try my netboot again, same host, and I get the following quickly flashed two-three times
"Downloading NBP file…

NBP file downloaded successfully."
But this seems to fail, as the disk fall-through picks up from there and I get back into initial OS.
And I don’t see the machine added to my list in the UI.
– edit –
I see these in the local log,

2023-12-11 13:33:17 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by 172.20.0.244
2023-12-11 13:33:18 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by 172.20.0.244
2023-12-11 13:33:18 provisioningserver.rackdservices.tftp: [info] pxelinux.0 requested by 172.20.0.244

Thoughts?

maristelsksk · 12 December 2023 04:14

You directly try to configure the public ip on your aws host so that region can discover the public ip. That’s what I’ve done before to make sure that my rackcontroller in my dc can communicate properly with the aws regioncontroller.

alfred-stokespace · 12 December 2023 15:21

@maristelsksk Oh! please explain how you did this?

I’ve associated public ip to the ec2 instance but ubuntu has no knowledge of that public ip address. So, when you look at ifconfig all you see is the local private ip.

How are you getting the region server to see the public ip?

alfred-stokespace · 12 December 2023 18:37

Perhaps what the commentor was refering to was this…
nic - Can a single network card have 2 IP addresses? - Server Fault

My first attempt was to just change the ip-address on the interface to the public ip-address… yooow! don’t do that! That immediately broke networking to the instance and AWS became confused about it’s status as a vm.

So, then I did this

ip addr add MY.PUB.LIC.IP dev ens5 label ens5:1

and wouldn’t you know it! The maasserver_regioncontrollerprocessendpoint table now has two sets of entries in it, one is the local/private ip and the other is the public ip.

Not a solution, … yet. I need to test if a rackd can actually reach this thing from another network.

alfred-stokespace · 12 December 2023 20:44

I was able to install rack on the second network.
The logs seem happy.

This might be the solution the basic problem i was having.

I tried booting a machine off the rack install and I’m currently battling with some issue related to the boot image not making secure boot happy.

r00ta · 13 December 2023 11:28

Are you trying to deploy Ubuntu or rhel?

alfred-stokespace · 13 December 2023 23:23

Ubuntu.

I got past this for now by using my newly minted rack server as the dhcp server and all my secure boot troubles dissappeared.

I do want to revisit this. Ideally I want my Mikrotik router to be doing dhcp and providing next-server as the rack server.

So far that fails 1st for secure boot reason then second (if you disable secure boot) the pxe boot process reports that it downloaded an image but then fails and spins on downloading. I could see logs in Rackd.log telling me that the host was asking for pxelinux.0

I’m satisfied with this thread being complete at this point. I’ll have to investigate the boot problems separately.