Skip to content

Trouble with IPv6 in a KVM guest running the 3.13 kernel

Some time ago, I wrote about two problems with the 3.13 kernel shipping with Ubuntu 14.04 LTS Trusty Tahr: One turned out to be a problem with KSM on NUMA machines acting as Linux KVM hosts and was fixed in later releases of the 3.13 kernel. The other one affected IPv6 routing between virtual machines on the same host. Finally, I figured out the cause of the second problem and how it can be solved.

I use two different kinds of network setups for Linux KVM hosts: For virtual-machine servers in our own network, the virtual machines get direct bridged access to the network (actually I use OpenVSwitch on the VM hosts for bridging specific VLANs, but this is just a technical detail). For this kind of setup, everything works fine, even when using the 3.13 kernel. However, we also have some VM hosts that are actually not in our own network, but are hosted in various data centers. For these VM hosts, I use a routed network configuration. This means that all traffic coming from and going to the virtual machines is routed by the VM host. On layer 2 (Ethernet), the virtual machines only see the VM host and the hosting provider's router only sees the physical machine.

This kind of setup has two advantages: First, it always works, even if the hosting provider expects to only see a single, well-known MAC address (which might be desirable for security reasons). Second, the VM host can act as a firewall, only allowing specific traffic to and from the outside world. In fact, the VM host can also act as a router between different virtual machines, thus protecting them from each other should one be compromised.

The problems with IPv6 only appear when using this kind of setup, where the Linux KVM host acts as a router, not a bridge. The symptoms are that IPv6 packets between two virtual machines are occasionally dropped, while communication with the VM host and the outside world continues to work fine. This is caused by the neigbor-discovery mechanism in IPv6. From the perspective of the VM host, all virtual machines are in the same network. Therefore, it sends an ICMPv6 redirect message in order to indicate that the VM should contact the other VM directly. However, this does not work because the network setup only allows traffic between the VM host and individual virtual machines, but no traffic between two virtual machines (otherwise it could not act as a firewall). Therefore, the neighbor-discovery mechanism determines the other VM to be not available (it should be on the same network but does not answer). After some time, the entry in the neighbor table (that you can inspect with ip neigh show) will expire and communication will work again for a short time, until the next redirect message is received and the same story starts again.

There are two possible solutions to this: The proper one would be to use an individual interface for each guest on the VM host. In this case, the VM host would not expect the virtual machines to be on the same network and thus stop sending redirect packets. Unfortunately, this makes the setup more complex and - if using a separate /64 for each interface - needs a lot of address space. The simpler albeit sketchy solution is to prevent the redict messages from having any effect. For IPv4, one could disable the sending of redirect messages through the sysctl option net.ipv4.conf.<interface>.send_redirects. For IPv6 however, this option is not available. So one could either use an iptables rule on the OUTPUT chain for blocking those packets or simply configure the KVM guests to ignore such packets. I chose the latter approach and added

# IPv6 redirects cause problems because of our routing scheme.
net.ipv6.conf.default.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0

to /etc/sysctl.conf in all affected virtual machines.

I do not know, why this behavior changed with kernel 3.13. One would expect the same problem to appear with older kernel versions, but I guess there must have been some change in the details of how NDP and redirect messages are handled.

Addendum (2014-11-02):

Adding the suggested options to sysctl.conf does not seem to fix the problem completely. For some reasons, an individual network interface can still have this setting enabled. Therefore, I now added the following line to the IPv6 configuration of the affected interface in /etc/network/interfaces:

        post-up sysctl net.ipv6.conf.$IFACE.accept_redirects=0

This finally fixes it, even if the other options are not added to sysctl.conf.

Update on KVM problems with kernel 3.13

A few weeks ago, I wrote about problems with kernel 3.13 on Ubuntu 12.04 LTS  and 14.04 LTS.

Most likely, the problem that caused the excessive CPU load and occassional high network latencey has been fixed by now and the fix is going to be included in version 3.13.0-33 of the kernel package. I experienced this problem on a multi-processor machine, so it is probable that this was the problem with KSM and NUMA that has been fixed.

I am not sure, whether the problems that I had  with IPv6 connectivity are also solved by this fix: I had experienced those problems on a single-processor (but multi-core) machine, so it does not sound like a NUMA problem to me.

Anyhow, I will give the 3.13 kernel another try when the updated version is released. For the moment, I have migrated all server machines back to the 3.2 kernel, because the 3.5 kernel's end-of-life is soon and the 3.13 kernel has not been ready for production use yet. I do not expect to have considerable gains by using a newer kernel version on the servers anyway, so for the moment, the 3.2 kernel is a good option.

Linux KVM Problems with Ubuntu 14.04 LTS / Kernel 3.13.0-30

A few days ago I upgraded a virtual-machine host from Ubuntu 12.04 LTS (Precise Pangolin) to Ubuntu 14.04 LTS (Trusty Tahr). First, I thought that everything was working fine.

However, a short time later I noticed funny problems with the network connectivtity, particularly (but not only) affecting Windows guests. Occasionally, ICMP echo requests would only be answered with an enormous delay (seconds) or sometimes not even be answered at all. TCP connections to guests would stall very often. At the same time the load on the host system would be high even though the CPU usage would not be extremely heavy.

After I downgraded the virtual-machine host back to Ubuntu 12.04 LTS (and consequently to kernel 3.5) this problems disappeared immediately.

It seems like this is a bug related to the 3.13 kernel shipped with Ubuntu 14.04 LTS. There is a bug report on Launchpad and a discussion on Server Fault. It might be that the other problems that I experienced with the backported 3.13 kernel are related to this issue.

For the moment I will keep our virtual-machine hosts on Ubuntu 12.04 LTS and kernel 3.5, until the problems with the 3.13 kernel have been sorted out.

Trouble after installing linux-generic-lts-trusty in Ubuntu 12.04 LTS

Yesterday I updated a lot of computers (hosts as well as virtual machines) running Ubuntu 12.04 LTS (Precise Pangolin) to the backported version of the 3.13 kernel. This kernel is provided by the linux-image-generic-lts-trusty package which is installed (together with the linux-headers-generic-lts-trusty package) when installing linux-generic-lts-trusty. By installing the backported kernel (before the update all Ubuntu 12.04 LTS systems where running on the 3.5 kernel provided by linux-generic-lts-quantal) I wanted to increase the uniformity between the Ubuntu 12.04 LTS and Ubuntu 14.04 LTS systems.

After installing the new kernel and rebooting the machines, funny network problems started to happen. For some virtual machines, IPv6 communication between virtual machines running on the same VM host became very unreliable. For other virtual machines, I experienced occassional huge delays (up to several seconds) for IPv4 packets.

After testing around for a few hours (at the same time I had upgraded a virtual-machine host to Ubuntu 14.04 LTS and first suspected this upgrade, specifically the new version of OpenVSwitch), I found out that these network problems were indeed caused by the new kernel in the virtual machines. If one of two virtual machines running on the same host had the new kernel running, the problems with IPv6 appeared. If both were running the old kernel version, the problems disappeared. The other problem with the massively delayed IPv4 packets was a bit harder to reproduce. Funnily, it already became much better when I downgraded just one of the virtual machines on the host.

At the current stage (linux-image-generic-lts-quantal-3.13.0-30), there seems to be a massive problem with the IP stack of the kernel. For some reasons, this problems only seem to be triggered if the kernel is running in a (Linux KVM) virtual machine. For now, I downgraded all virtual machines back to the old kernel version.

I have to do some more tests to find out whether these problems are caused by the newer kernel in general or whether they are specifically caused by the backported version. At the moment I only have one virtual machine with Ubuntu 14.04 LTS, so I will have to setup some test VMs to carry out more tests.

Until then, I can only recommend to stay away from the backported 3.13 kernel, at least for virtual machines.

Nagios check_linux_raid missing in Ubuntu 14.04 LTS

I just upgrade a KVM virtual machine host from Ubuntu 12.04 LTS (Precise Pangolin) to Ubuntu 14.04 LTS (Trusty Tahr). Everything went smoothly except for one problem: The check_linux_raid script is missing in the updated version of the nagios-plugins-standard package.

The nagios-plugins-contrib package seems to contain a script which basically does the same job, but this package has a lot of other plugins that pull tons of additional dependencies, so I did not want to install this package. Luckily, just copying the check_linux_raid script from a system with the older version of Ubuntu worked fine for me.

Migrating the EJBCA database from H2 to PostgreSQL

I recently installed EJBCA for managing our internal public key infrastructure (PKI). Before using EJBCA, I used openssl from the command-line, but this got uncomfortable, in particular for managing certificate revocation lists (CRLs).

Unfortunately, I made a small but significant mistake when setting up EJBCA: I chose to use the default embedded H2 database. While this database for sure could handle the load for our small PKI, it is inconvenient when trying to make backups: The whole application server needs to be stopped in order to ensure consistency of the backups, a solution which is rather impractical. Therefore I wanted to migrate the EJBCA database from H2 to PostgreSQL.

However, H2 and PostgreSQL are quite different, and the SQL dump generated by H2 could not be easily imported into PostgreSQL. After trying various approaches, I luckily found the nice tool SQuirreL SQL, which (besides other things) can copy tables between databases - even databases of different type. Obviously, this will not solve all migration problems, but for my situation it worked quite well.

I documented the whole migration process in my wiki, in case someone else wants to do the same.

Bare-Metal Recovery of Windows Server 2012 R2 using Bacula

I have been using Bacula as our main backup system for years. While Bacula works perfectly for Linux systems, bare-metal recovery (also known as disaster recovery) of Windows systems has been an open issue ever since.

The Bacula manual describes some procedures, but they only apply to systems running an operating system not newer than Windows Server 2003 R2. Even these procedures remain a bit unclear. If you look for solutions that cover Windows Server 2008 and newer versions of Windows, you will only find a few mailing-list posts that discuss using Windows Server Backup in combination with Bacula. However, none of these solutions sound very appealing.

I believe that you do not have a backup unless you tested the restore, I wanted to find out the best way for backing up a Windows system with Bacula. So I spent some time and installed a Windows Server 2012 R2 system in a virtual machine, made a backup with Bacula, and then tried to restore this backup in a new virtual machine. I actually succeeded without using Windows Server Backup or any other third-party tool. It really seems to work with a Bacula-only solution.

I documented the steps I used in the wiki, just in case I might have to restore a Windows System from a Bacula backup in the future. Maybe this guide is useful for you as well.

OpenLDAP Server not listening on IPv6 Socket in Zimbra 8

Recently I have been experiencing a strange with an installation of the Community Edition of Zimbra Collaboration Server 8: Although all services were running, no e-mails were delivered. In the log file /var/log/zimbra.log I found messages like "zimbra amavis[9323]: (09323-01) (!!)TROUBLE in process_request: connect_to_ldap: unable to connect at (eval 111) line 152.".

The strange things about this was, that the OpenLDAP daemon (slapd) was running and answering requests. After restarting Zimbra (/etc/init.d/zimbra restart), the problem disappeared, however it reappeared after the next reboot.

After some time I figured out, that - right after the reboot - slapd was only listening on an IPv4 socket, not on an IPv6 socket. After restarting the OpenLDAP server (ldap stop && ldap start as user zimbra), the problem disappeared again and netstat showed that now slapd was also listening on the IPv6 socket.

In the end I could not figure out, why the OpenLDAP daemon would only listen on IPv4 when started during system boot but would listen on both IPv4 and IPv6 when started later. I was suspecting some problem with name resolution in the early boot process (although both the IPv4 and the IPv6 address were listed in /etc/hosts).

However, I found a work-around for the problem: By setting the local configuration option ldap_bind_url to ldap:/// (zmlocalconfig -e ldap_bind_url=ldap:///) , I could configure OpenLDAP to listen on all local interfaces, which apparently fixed the problem.

RTFM or better don't...

While I am writing about curious bugs, here is another one, although technically it is not really a bug.

When setting up Icinga with mod_gearman, I wondered why service-checks where running on the assigned mod_gearman worker node, but host-checks were running on the main Icinga server and were not distributed using mod_gearman. I checked the configuration again and again, but could not find an error. Also searching the web did not bring much useful information.

The only thing that I could find were hints that do_hostchecks had to set to "yes" in /etc/mod-gearman/module.conf. But according to the mod_gearman documentation, this option was set to "yes" by default.

Well, as it turns out, the flag is set to "no" by default, at least in the version of mod_gearman that is available in the software repositories of Ubuntu 12.04 LTS (Precise Pangolin). By the way, the manual that is distributed in the source archive of mod_gearman 1.2.2 (this is the same version that comes with Ubuntu) says the same, so it is not a thing that was changed recently.

OpenDKIM bug in Zimbra Collaboration Server

Recently I stumbled across a bug in the OpenDKIM configuration of the Zimbra Collaboration Server.

In ZCS 8.0.3 (Community Edition, but I guess the same applies to the Network Edition), the file /opt/zimbra/conf/ specifies the socket that OpenDKIM listens on in the following way:

Socket                %%zimbraInetMode%%:8465@[%%zimbraLocalBindAddress%%]

This results in the following socket address of "inet6:8465@[::1]" in the final file (opendkim.conf). However, the Postfix configuration file /opt/zimbra/postfix/conf/ specifies the socket as "inet:localhost:8465". This leads to Postfix trying to connect to an IPv4 socket, while OpenDKIM is listening on an IPv6 socket, so that the connection cannot be established.

The fix is quite easy: By changing "%%zimbraInetMode%%:8465@[%%zimbraLocalBindAddress%%]" to "inet:8465@[]" in and restarting Zimbra, OpenDKIM can be made to listen on an IPv4 socket, so that Postfix can connect again.

The curious thing is, that this bug has already been reported half a year ago and has supposedly been fixed. However, it seems like this fix was only applied to the 9.0 branch of Zimbra and not to Zimbra 8.0.

Update on KVM Shutdown on Ubuntu

About two years ago, I wrote an article about how to make libvirt on Ubuntu 10.04 LTS to shutdown the virtual machines gracefully, when the host system is shutdown or rebooted.

Now I recently found out, that they implemented a similar approach in Ubuntu 12.04 LTS. The only problem with this is, that the default timeout is too short (30 seconds) for virtual machines running complex services. Therefore, I documented how to change this timeout in my wiki.

Less Trouble with KVM virtio and DHCP

In an earlier blog post I claimed that I was seeing problems with VMs using the virtio driver for networking on an Ubuntu 12.04 LTS KVM host using DHCP.

However, as far as I am concerned, this claim was wrong. I now figured out, that the messages about bad UDP checksums had nothing to do with my problem. I was rather experiencing the problems caused by a configuration that did not list the VLAN network interface (eth0.X) on which the DHCP relay agent received the answers from the DHCP server.

The mean thing is, that switch away from virtio fixed this problem. However, this was not because of the UDP checksums now being right (this was merely a side effect). It fixed the problem, because when not using the virtio driver, the DHCP relay agent would receive the answer packets, even if they were received on a VLAN interface it was not listening to. I can only guess that the implementation for VLAN-tagged interfaces is slightly different when using the virtio driver.

After adding the interface to the list of interfaces used by the DHCP relay agent, the DHCP packets are relayed correctly, even if using the virtio driver. The messages about bad UDP checksums now reappeared in the log file, but obviously this is not causing any trouble.

On the other hand, according to a bug report some users really seem to have problems with DHCP when using the virtio driver. However, this might only affect Ubuntu 12.04 LTS guests but not VMs on a Ubuntu 12.04 LTS host.

Trouble with KVM virtio and DHCP

Lately I experienced a problem with KVM-based virtual machine running a DHCP server and another one running a DHCP relay (for both I use the ISC implementations). The DHCP relay was complaining about "bad udp checksums". Using tcpdump and wireshark I quickly found out, that the software was right and the UDP checksums were in fact wrong. After some searching, I found a bug report, that basically described the same problem.

Although I cannot verify this, I think the problem might be related to the fact that I recently upgraded the host machine from Ubuntu 10.04 LTS (Lucid) to Ubuntu 12.04 LTS (Precise). As a workaround, I deactivated the use of the "virtio" support for the network interface in both virtual machines, which seems to fix the problem, because then the UDP checksums are correct.

However, when I performed the same change for a virtual machine still running on an Ubuntu 10.04 LTS host, this actually caused a problem: If VLAN interfaces are used inside the virtual machine, the normal non-virtio driver will screw things up on a Ubuntu 10.04 LTS host.

Long story short: For virtual machines running on a Ubuntu 10.04 LTS host you should use and for a Ubuntu 12.04 LTS you should avoid virtio networking.

Update [2012-07-08]: It seems like the conclusions I draw in this article are actually wrong. Therefore I posted an update clarifying the situation.

Trouble with globally installed Firefox extensions after software upgrade in Ubuntu

Some time ago Ubuntu 10.04 LTS received a Firefox update from the 3.6 branch to the more recent versions (10.0+).

After that upgrade, Firefox on one of my Ubuntu systems suddenly appeared in English instead of the correct locale (German in my case). I first thought that this might be a problem with some localization packages not being installed correctly. However, the problem persisted after upgrading the system to Ubuntu 12.04 LTS.

Some globally extensions (in particular the language pack) showed up in an old version in the add-ons list, although the newest version was installed. Finally I found out, that Firefox looks for the extensions in /usr/lib/firefox/extensions. The new language packs however had been installed in /usr/lib/firefox-addons/extensions. On other systems, /usr/lib/firefox/extensions is a symbol link to /usr/lib/firefox-addons/extensions. In my case however the directory existed and contained the files from the old versions of the language packs.

For some reasons, the old language packs (which had different package names) had not been removed and thus the Firefox upgrade did not place the symbol link (because the directory was not empty). After manually removing the old versions of the language packs, deleting the directory and reinstalling Firefox, the symbol link was created automatically and suddenly the globally installed Firefox add-ons worked again.

KVM and Graceful Shutdown on Ubuntu

For quite some time, I have been trying to figure out, how to gracefully shutdown the KVM-based virtual machines running on a Ubuntu 10.04 LTS (Lucid Lynx) host system. This problem consists of two parts: First you have to make the virtual machines support the shutdown event from libvirt and second you have to call the shutdown action for each virtual machine on system shutdown.

The first part is very easy for Linux VMs and also not too hard for Windows VMs. I described the necessary steps in my wiki

The seconds part is harder to accomplish: On Ubuntu 8.04 LTS (Hardy Heron) I just modified the /etc/init.d/libvirt-bin script to call a Python script in the stop action. This solution was not perfect, as it meant that the virtual machines were also shutdown, when libvirtd was just restarted, however it was a quick and easy solution.

For Ubuntu 10.04, the init script has been converted to an Upstart job. So the easiest way was to create a upstart job that is starting on the stopping libvirt-bin event. However, this did not solve the problem, because the system powered off or rebooted before the shutdown of the virtual machines was finished. As it turns out, Ubuntu 10.04 uses an odd combination of Upstart jobs and traditional init scripts. This leads to a situation, where /etc/init.d/halt or /etc/init.d/reboot are called, before all upstart jobs have stopped, when one of the upstart jobs needs a significant amount of time to stop. This can be solved by adding an init script, than runs before the halt or reboot scripts and waits for the respective Upstart job to finish. In fact, it is best to run this script before the sendsigs script to avoid processes started by one of the upstart jobs to receive a SIGKILL.

I added the complete scripts and configuration files needed for this feature to my wiki. In fact, this solution also ensures, that the virtual machines are only shutdown if libvirt is stopped because of a runlevel change. Thus, the libvirt-bin package can now be upgraded without resulting in a restart of the VMs.

For me, automatically shutting down the virtual machines is very important. The KVM hosts I manage are connected to an uninterruptible power supply with limited battery time. Although in the past years I remember only a single time, the host systems were shutdown because the battery was nearly empty (most power interrupts are very short), I want to make sure that all virtual machines are in a safe, consistent state, when the power finally goes off. So I hope that the scripts in the wiki are also helpful to other people, having the same problem.