The KVM server hosting my website went offline last month. Thinking the server might have crashed, I went to Virtualizor, the VPS control panel, to reboot the VPS. It did not solve the problem, so I proceeded with my disaster recovery plan.
The hosting provider, Spartan Host, explained that it was a router bug. They fixed the router after 4 hours, but my server did not come online.
Symptom
To investigate what went wrong with my VPS, I came back to Virtualizor to enable VNC access. Having VNC access is like attaching a monitor and a keyboard to the server. It would allow me to see any error messages printed on the screen and login to check whether there are configuration errors.
I didn't see any error through VNC connection.
Thinking it might be a routing problem, I logged in with username and password, and ran a traceroute
.
To my surprise, the traceroute
was able to reach Internet destination.
Moreover, I can SSH into this server again.
Seeing the problem went away, I disabled VNC access in Virtualizor. Then, I pressed the reboot button in Virtualizor, so that the hypervisor would apply the VNC settings; rebooting via SSH would not be effective.
Then, I started a ping
to the server from my desktop, and eagerly waited.
One minute, two minutes, …, the server did not come online.
What went wrong again?
I repeated the process, re-enabled VNC access, and saw nothing wrong. I disabled VNC again, and the VPS lost connectivity again. Clearly, there's a correlation between the VNC toggle and network connectivity.
Diagnostics Ⅰ
Virtualizor is known to push feature updates in patch releases that sometimes breaks things, so I asked whether there was a Virtualizor update recently, but the answer was no. I couldn't figure out the problem, so I opened a support ticket with Spartan Host, and sent over the Netplan configuration file in this Ubuntu 20.04 server.
network:
version: 2
ethernets:
ens3:
dhcp4: true
addresses:
- 2001:db8:2604:9cc0::1/64
- 2001:db8:2604:9cc0::80/64:
lifetime: 0
routes:
- to: ::/0
via: 2001:db8:2604::1
on-link: true
In this configuration:
- IPv4 address is acquired from DHCP, which is provided by the hypervisor.
- IPv6 is statically configured.
- NIC name
ens3
is hard-coded, because it never changes.
I've been using similar Netplan configuration files in several other KVM servers, and never had a problem.
Ryan McCully, the managing director at Spartan Host, performed some tests on my KVM. He found that, if VNC is disabled, the network interface on the VPS seems to be "completely dead", as there's no ARP or any other traffic seen at the hypervisor side. He was also puzzled why VNC would affect network interface, but offered an explanation of why my VPS was working in the past few months:
- As mentioned above, setting changes in Virtualizor are applied to the hypervisor only after the reboot button in Virtualizor is pressed.
- Most likely, when I finished the initial setup, I disabled VNC but didn't reboot in Virtualizor.
- In this case, VNC tabs are hidden in Virtualizor, but VNC is still enabled on the hypervisor.
Diagnostics Ⅱ
Ryan spent a few more hours of extensive testing, and was able to reproduce this issue. It turned out that disabling VNC in Virtualizor changes the network interface name.
The KVM hypervisor realizes VNC through an emulated VGA monitor.
Enabling VNC attaches a VGA controller to the virtual server, while disabling VNC detaches it.
To see this effect, we can schedule to run lspci
command upon reboot in crontab, and look at the output file when we regain access.
$ : VNC enabled
$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 Multimedia audio controller: Intel Corporation 82801AA AC'97 Audio Controller (rev 01)
00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:06.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
$ : VNC disabled
$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:03.0 Multimedia audio controller: Intel Corporation 82801AA AC'97 Audio Controller (rev 01)
00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:05.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
More importantly, addition or removal of the VGA controller changes the PCI address of the Ethernet controller. This, in turn, changes the network interface name, because Ubuntu adopts Consistent Network Device Naming, in which the name of a network interface is derived from its PCI address.
VNC | PCI address | interface name |
---|---|---|
enabled | 03:00.0 | ens3 |
disabled | 02:00.0 | ens2 |
Therefore, "consistent" network device naming scheme is consistent only if the PCI address doesn't change. However, PCI addresses aren't always stable. I've seen PCI address changing on a dedicated server when I configure PCI bifurcation in BIOS settings. Now I've seen PCI address changing on a KVM virtual server.
Treatment
My Netplan configuration assumes the Ethernet adapter on the KVM server is always "ens3". Disabling VNC changes the network interface name to "ens2", and Netplan would not bring up an interface that it doesn't know about. This caused the VPS to lose connectivity.
To solve this problem, the Netplan configuration should identify the network interface by its MAC address. MAC address can be considered a stable identifier of the network interface, because it's one of the outputs from Virtualizor's Create VPS API.
Therefore, I changed Netplan configuration to this:
network:
version: 2
ethernets:
uplink:
match:
macaddress: 6a:ed:d6:3b:49:f4
set-name: uplink
dhcp4: true
addresses:
- 2001:db8:2604:9cc0::1/64
- 2001:db8:2604:9cc0::80/64:
lifetime: 0
routes:
- to: ::/0
via: 2001:db8:2604::1
on-link: true
Unless the VPS is deleted and re-created with a different MAC address, this should continue to work. I also renamed the network interface to "uplink", so that I don't need to check whether it's "ens3" or "ens2" when I type commands.
Elsewhere
I checked the VNC situation on several other KVM servers that I have.
WebHorizon and Evolution Host both modified Virtualizor such that there isn't an option to disable VNC. This prevents the issue, but increases the risk of my VPS being compromised via VNC.
WebHosting24 kept the VNC option intact in Virtualizor. Disabling VNC would lead to changed network interface names, but my new Netplan configuration works.
SolusVM, the VPS control panel used at VirMach and Nexril, keeps the VGA controller attached at all times. Disabling VNC blocks the VNC port so that nobody can connect to it, but does not affect the KVM server itself. I think this is a better approach.
Acknowledgement
Kudos to Ryan McCully at Spartan Host for helping me hunt down this issue. I wouldn't have anticipated the root cause without his help.