Tailscale on blog.iankulin.com

Tailscale keys expire

Tue, 24 Oct 2023 00:00:00 +0000

I have an Ansible playbook I run each weekend to do all the apt updates. As well as keeping everything up to date, it’s a good check-in that everything’s alive and working as expected. I have Uptime Kuma checking the services are alive, and that no one is running out of disk or memory so there shouldn’t be any drama right?

This weekend, three instances (two remote, one local) timed out with “unreachable”.

Since Ansible is effectively ssh-ing in, I guess try that from the terminal.

vm100-dockhost is the “magic DNS” name for this machine. One of the cool things Tailscale does is to allow these sorts of names. I use them so much, I’ve forgotten all their IP addressees. When I look it up and try with the local IP address for this machine, it works fine.

Since it seems like a Tailscale problem, I tried turning it off and on again with sudo tailscale down and sudo tailscale up. When it came up, it printed the URL to re-authenticate - so something’s happened…

It turns out that Tailscale keys expire for security reasons - by default every 180 days. Once the key is expired, you can’t access that machine via the Tailnet. Obviously, this is going to make an issue if you have a remote site and the key expires. So how can we prevent it from happening?

My first idea was to use the Tailscale CLI to do the re-authentication on each machine before it expires. And handily, there is a command for this:

tailscale up --force-reauth

But, small catch (mentioned in the docs, or in the CLI if you try it) if you are ssh’d in over Tailscale, when you run this, it actually drops the ssh link. So you’ll never see the URL you need to re-authorise, so now you’ve lost access to that machine.

If a key has expired, it is possible to remotely reauthorise it from your machines admin page for a short period it to allow someone with local access to reauthorise it properly. If you don’t have local access to it, you’re in trouble if you discover this after it’s expired. I guess it would be possible to write a script to run the tailscale up on the remote machine, capture the output and send it to me, but that’s starting to sound like more work than I want to do.

Avoiding the problem

If you want to avoid the problem of Tailscale keys expiring on remote systems, it’s possible to turn it off so they never expire. This option is in the menu for each machine on the machines admin page.

I guess another way of avoiding this problem, if it’s possible, would be to visit your remote sites every six months and do the force update to reset the expiry. For my setup of the remote backup sites that’s a reasonable plan.

One slightly annoying thing is that it’s not easy to see the expiry date of each Tailscale instance. I would have thought it would appear on that machines admin page, or in the CLI with tailscale status. When I was searching for an answer, I see that there is an open github issue for it, and there’s been an update to the JSON version of the tailscale status command that includes the key expiry date.

Getting Tailscale working in LXC containers

Wed, 18 Oct 2023 00:00:00 +0000

I’ve taken to running lots of my services in LXC containers under Proxmox. I like the feeling of installing in a VM, but it’s lightweight. I like the backups, I like things being isolated from each other, I like moving them around between machines easily. I’m just a big LXC lover at the moment.

I’m also a Tailscale lover, and the generous number of nodes in the free tier means I now just routinely install them in my VMs and containers without a thought.

There is an issue with unprivileged LXC containers and Tailscale though. Unprivileged containers have less access to the host system’s internals, and are therefore a bit safer, but part of that reduced access includes some of the networking stuff that Tailscale needs. If you try to install Tailscale, it will look fine, until you get to the tailscale up command, at which point it will say something like:

failed to connect to local tailscaled (which appears to be running as tailscaled, pid 3121). Got error: 503 Service Unavailable: no backend

There is an easy way to fix this, documented in a Tailscale how to guide. Basically you need to stop the container and edit the LXC conf file. These are named by the container number. My container is 354, so the conf file is /etc/pve/lxc/354.conf

Add the lines:

lxc.cgroup2.devices.allow: c 10:200 rwm
lxc.mount.entry: /dev/net/tun dev/net/tun none bind,create=file

This creates a TUN/TAP device (commonly used for VM networking) and creates a bind point to it inside the container. The effect of this is to enable the container to work with TUN/TAP devices and use them for networking purposes. This can be essential for various networking-related applications or services running within the container - including, in this case, Tailscale.

Start the container again, redo your tailscale up, and you should be in business.

Solved DNS Issues - Proxmox, LXC, Ubuntu, Tailscale

Fri, 06 Oct 2023 00:00:00 +0000

I’ve picked up an new TP-Link WAP with Omada, so I wanted to spin up an Ubuntu 20.04 LXC to run the controller software in, and ended up spending a couple of hours figuring out why things where not working.

The initial problem was I was having connectivity issues pulling down the updates for all the packages required. I went down a bit of a tangent because I installed an apt cache the other day, so I was looking for problems there. Eventually I narrowed it down to DNS not working and started A/B testing like this:

A more seasoned sysadmin probably would have been looking at the /etc/resolv.conf a bit earlier where the glaring hint was. I’ll get to that in a second, but first a bit about my setup.

I’m running Proxmox 8.0.4 on one of my HP G2 800 Minis (love these little power-frugal gems) and I use Tailscale to tie all my network (my homelab here, and two remote locations) together. The Tailscale version on this node is 1.48.1

You can see in the table above, that a LXC using the Ubuntu 20.04 template had no domain name resolution, but the Debian 12 (and Debian 11 I tried earlier did). The /etc/resolv.conf on the Debian containers looked like this:

nameserver 192.168.100.1

And on the Ubuntu container

# --- BEGIN PVE ---
search tailaf96a.ts.net
nameserver 100.100.100.100
# --- END PVE ---

192.168.100.1 is my local DNS which is provided from the DHCP, but clearly Ubuntu is not using that. The PVE comments tells me it’s Proxmox messing with my container, and that’s the Tailscale DNS server number in there. The container does not have a route to 100.100.100.100 so that DNS is not going to be able to resolved anything.

So, that’s a bit weird, but easily fixed by just editing this back to set the nameserver to 192.160.100.1 right? Well, yes - if you do that, it works, but then as soon as the container is rebooted, the Tailnet DNS gets written back in. Those blocky PVE comments are probably part of the automated system for doing that. So, what’s going on here?

There’s two screens for network configuration when you’re creating an LXC container in the Proxmox GUI.

There’s no option in the GUI to just say “Use the DNS settings provided by the DHCP server”, although we’ll see later, there is a work around for this.

Since I’d been leaving the DNS domain: set to use host settings. You might reasonably wonder what the Proxmox node /etc/resolv.conf looks like:

# resolv.conf(5) file generated by tailscale
# For more info, see https://tailscale.com/s/resolvconf-overwrite
# DO NOT EDIT THIS FILE BY HAND -- CHANGES WILL BE OVERWRITTEN

nameserver 100.100.100.100
search tailaf96a.ts.net local

So actually, although I was thinking there must be some bug with Ubuntu since Debian was working how I expected, it’s the other way around - Ubuntu and Proxmox are working together to do exactly what the settings have told it to - to use the host settings. And actually, the Debian containers are not working correctly (although they were working how I expected). The process of Proxmox making these types of changes is documented in the Admin Guide. I’d actually never seen that guide till today (although there is a large “Documentation” button in the top right of the web GUI), but it looks pretty great so I’ll be revisiting it.

Solution 1

The first solution is just to specify the DNS address in the GUI - then our container works exactly as the PVE developers intended. A slight downside is that if I change the network configuration in future and update the DNS address in the DHCP server (which is the logical way to do that) then it won’t update for this container and domain name resolution will stop working for it.

If I do that, the /etc/resolv.conf looks like this:

# --- BEGIN PVE ---
search tailaf96a.ts.net
nameserver 192.168.100.1
# --- END PVE ---

And it all works fine.

Solution 2

This post on the Proxmox Forums lead me to a second solution. It’s possible to stop Proxmox from adding the host by adding a little signal file with

touch /etc/.pve-ignore.resolv.conf

When Proxmox sees that. it won’t mess with the /etc/resolv.conf file, so if that’s been edited to:

nameserver 192.168.100.1

It will be left alone, and things will work fine. This is not quite what I’d like - I’d really prefer it picks everything up from DHCP, but I don’t know enough about how that works in Linux to fix it, yet.

Proxmox 8.0 Install

Sun, 23 Jul 2023 00:00:00 +0000

I’m normally a x.1 release type of sysadmin, but the increasing temptation of installing Proxmox 8.0 while I’ve got some time off, and the fact that I’ve got a cluster, so I can just move the VM’s around all adds up to thinking I’ll do that today.

Here’s how my system works. It consists of three HP-800 mini G2’s. pve-prod1 is a bit fancier - i7 6700T and 32GB, the other two are i5 6500T and 16GB. The production VM’s use the local SSD but backups go to the NAS. All the machines are currently running Proxmox 7.4. They are not clustered in the proper sense - I don’t need high availability, and I don’t want to run them all the time. pve-prod1 runs 24/7 and I just power up pve-dev1 when I’m working on something.

The intention is that although I’m not on high availability, I can quickly come back from a machine failure by powering pve-prod2 up and restoring from the latest VM backup from the NAS. pve-prod1 does not have a full load yet (I’m slowly cancelling cloud services and moving them in-house) but once it does, I’d have the capacity to fully replace it by sharing any guests between pve-prod2 and pve-dev1.

Migration plan

Currently pve-prod1 is only running two guests, jellyfin, and a docker host with a collection of smallish services. The plan is to move those to pve-prod2, check everything is working, then install the new Proxmox 8 onto pve-prod1. Apart from giving me the opportunity to do that, it’s a good test of the plan for recovering from a pve-prod1 failure. I’ll live off it for a few days to ensure that it’s a viable process.

A small hitch with this is that the RAM in pve-prod1 cost me $100, and I didn’t want to not use it, so I created the jellyfin VM with 16GB RAM. It’s a simple matter to stop it, give it less, and restart it - except it seems to be using it all.

You can see from this, I tried shutting it down and restarting - thinking that the memory use might climb up slowly as the app was used, but it just went straight back to 15GB. In a way, I approve of a VM using the memory I’ve given it - presumably it is caching or something. Jellyfin should certainly be able to run on a machine with much less memory, so I suppose I’ll stop it, back it up, and try it in a smaller VM.

Yep, that works fine. And I can’t notice any difference in the app performance. So I stopped it, backed it up, and restored onto prod2. And immediately bumped into a couple of problems when I tried to start it.

There was two hardware incompatibilities - the first was that on prod1 I had passed through the GPU from the host (in an unsuccessful attempt to use quicksync hardware transcoding for video). I don’t need that, so that gets deleted out of the ‘hardware’ for the VM.

And the second was that I still had the Debian 11 ISO mounted in the ‘cd-rom’. Lol - the Debian installer specifically tells you to remove this before it reboots. That can be removed exactly as I had done for the GPU pass through, and the VM boots fine, and the app tests out ok.

The first time I ever did this - move a guest VM from one lot of hardware to another, then boot it up and all my apps are working perfectly on their old IP addresses - I was amazed and danced around in excitement. I didn’t dance today, but it is so cool.

Interestingly, it’s decided to use much less RAM now. I caused that increase at the end of the graph by rescanning the media library, then browsing through all the titles so the cover images would have to be loaded - so perhaps it’s the web server caching them all. It’s hard to know for sure without some objective measurements, but I suspect the app was crisper and more responsive than before. In any case, it certainly wasn’t any worse.

Moving the docker host over was straightforward and only took five minutes of downtime as it’s a smaller image. I guess a lot of that time is just my 1GB network limitation or the spinning disk transfer speed from the NAS - the docker hoats was 4GB and Jellyfin 14GB.

Nuke and pave

I try and keep my hosts very clean, so wiping them and starting over is no biggie, but since this node has been up I have installed a chron job for temperature logging. I’ve documented that in a blog post so I’ll be able to recreate it, but this sort of thing is the reason I’m interested in Ansible. Another project while I’ve got some time will be to recreate that on the new machine with Ansible so it’s trivial to restore in future. I pulled the temperature log file down though - because who doesn’t like eighty thousand data points.

There is a published process to upgrade Proxmox from 7.x to 8, so I briefly considered it, but fresh installs are generally less likely to lead to drama, especially this early in the major release cycle. Plus, I keep my installs clean to allow it - this is a freedom allowed by my sysadmin discipline along with the investment in redundant hardware so there’s zero time pressure while I’m doing it.

Run Book for New Proxmox Install

My install process for Proxmox goes something like this:

Flash the ISO onto a USB drive with Balena Etcher
Plug in the USB drive, my bluetooth keyboard/mouse USB, and the screen - I’ve got a special long HDMI cord that reaches from my desk to the servers
Boot up, mashing the boot menu key (F9 on my G2’s)
Follow my nose through the prompts - since this is an existing server, the DHCP serves up the correct IP address
ssh into it to check everything’s fine. Since this IP was already in my known hosts file, I had to go an delete it out
ssh-copy-id to get my ssh keys across
Update the repositories - by default, Proxmox comes set up to use with a subscription. I wish they had a lower tier and I’d by one since it gives me so much joy - even if it didn’t remove the nags. In the meantime, you can follow the instructions here to set it up to use the non-subscription repoistories:
- edit /etc/apt/sources.list to add deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription
- edit /etc/apt/sources.list.d/pve-enterprise.list to comment out the line in there
- and a new one that’s not mentioned on that wiki page, edit /etc/apt/sources.list.d/ceph.list to comment out the line in there. I don’t know where that leaves you if you are using Ceph (which is a cool file system if you’re using high availability) but I’m not, so all good. If you don’t do this, you’ll get errors like E: Failed to fetch https://enterprise.proxmox.com/debian/ceph-quincy/dists/bookw orm/InRelease 401 Unauthorized IP: 103.76.41.50 4431 E: The repository "https://enterprise.proxmox.com/debian/ceph-quincy bookworm In Release' is not signed.
Run the updates with apt update && apt upgrade
Install the certificate - you need SSL setup for the web interface if you want Chrome to let it save your password, which I do. Also the red insecure message bugs me
- Log into the web interface at https://:8006 - you’ll need to jump through all those hoops to take on the responsibility of opening an unsecured site
- If you click on the node, then certificates

- You can open up that certificate, and copy out the raw certificate, paste it into a text editor and save it somewhere. I drag that into my macOS keychain app. It shows up with a red cross, but if you open it up you can mark it as “always trust”
- We’re not done yet, now back in Chrome, click on the insecure message next to the URL. Go into Site Settings | Insecure Content and change it to Allow
- Almost there - at the top of those settings is a button to clear the cache, do that
- Reload the page. Profit.
Then I install Tailscale
Last of all, add my NAS to the storage. I use NFS. The only trick here is to go into the dropdown of what type of content is on that storage, and select everything

And that’s it. Nice new Proxmox. I’ll leave my production VM’s on pve-prod2 for a week, and move all of my dev work over to this machine so it gets some exercise before I upgrade the other machines.

Tailscale

The only small issue I ran into (apart from the Ceph repository) was I couldn’t access the machine via it’s “magic DNS” Tailscale name. Since it was going to be the same name as a machine in my existing network, I’d thought ahead and deleted the old one out via the Tailscale machines page, but even so, it wouldn’t connect from my laptop.

I assume the old Tailscale IP address was cached somewhere, and fixed it by turning Tailscale off and on again on my laptop.