Skip to content

Update on KVM problems with kernel 3.13

A few weeks ago, I wrote about problems with kernel 3.13 on Ubuntu 12.04 LTS  and 14.04 LTS.

Most likely, the problem that caused the excessive CPU load and occassional high network latencey has been fixed by now and the fix is going to be included in version 3.13.0-33 of the kernel package. I experienced this problem on a multi-processor machine, so it is probable that this was the problem with KSM and NUMA that has been fixed.

I am not sure, whether the problems that I had  with IPv6 connectivity are also solved by this fix: I had experienced those problems on a single-processor (but multi-core) machine, so it does not sound like a NUMA problem to me.

Anyhow, I will give the 3.13 kernel another try when the updated version is released. For the moment, I have migrated all server machines back to the 3.2 kernel, because the 3.5 kernel's end-of-life is soon and the 3.13 kernel has not been ready for production use yet. I do not expect to have considerable gains by using a newer kernel version on the servers anyway, so for the moment, the 3.2 kernel is a good option.


Linux KVM Problems with Ubuntu 14.04 LTS / Kernel 3.13.0-30

A few days ago I upgraded a virtual-machine host from Ubuntu 12.04 LTS (Precise Pangolin) to Ubuntu 14.04 LTS (Trusty Tahr). First, I thought that everything was working fine.

However, a short time later I noticed funny problems with the network connectivtity, particularly (but not only) affecting Windows guests. Occasionally, ICMP echo requests would only be answered with an enormous delay (seconds) or sometimes not even be answered at all. TCP connections to guests would stall very often. At the same time the load on the host system would be high even though the CPU usage would not be extremely heavy.

After I downgraded the virtual-machine host back to Ubuntu 12.04 LTS (and consequently to kernel 3.5) this problems disappeared immediately.

It seems like this is a bug related to the 3.13 kernel shipped with Ubuntu 14.04 LTS. There is a bug report on Launchpad and a discussion on Server Fault. It might be that the other problems that I experienced with the backported 3.13 kernel are related to this issue.

For the moment I will keep our virtual-machine hosts on Ubuntu 12.04 LTS and kernel 3.5, until the problems with the 3.13 kernel have been sorted out.

Trouble after installing linux-generic-lts-trusty in Ubuntu 12.04 LTS

Yesterday I updated a lot of computers (hosts as well as virtual machines) running Ubuntu 12.04 LTS (Precise Pangolin) to the backported version of the 3.13 kernel. This kernel is provided by the linux-image-generic-lts-trusty package which is installed (together with the linux-headers-generic-lts-trusty package) when installing linux-generic-lts-trusty. By installing the backported kernel (before the update all Ubuntu 12.04 LTS systems where running on the 3.5 kernel provided by linux-generic-lts-quantal) I wanted to increase the uniformity between the Ubuntu 12.04 LTS and Ubuntu 14.04 LTS systems.

After installing the new kernel and rebooting the machines, funny network problems started to happen. For some virtual machines, IPv6 communication between virtual machines running on the same VM host became very unreliable. For other virtual machines, I experienced occassional huge delays (up to several seconds) for IPv4 packets.

After testing around for a few hours (at the same time I had upgraded a virtual-machine host to Ubuntu 14.04 LTS and first suspected this upgrade, specifically the new version of OpenVSwitch), I found out that these network problems were indeed caused by the new kernel in the virtual machines. If one of two virtual machines running on the same host had the new kernel running, the problems with IPv6 appeared. If both were running the old kernel version, the problems disappeared. The other problem with the massively delayed IPv4 packets was a bit harder to reproduce. Funnily, it already became much better when I downgraded just one of the virtual machines on the host.

At the current stage (linux-image-generic-lts-quantal-3.13.0-30), there seems to be a massive problem with the IP stack of the kernel. For some reasons, this problems only seem to be triggered if the kernel is running in a (Linux KVM) virtual machine. For now, I downgraded all virtual machines back to the old kernel version.

I have to do some more tests to find out whether these problems are caused by the newer kernel in general or whether they are specifically caused by the backported version. At the moment I only have one virtual machine with Ubuntu 14.04 LTS, so I will have to setup some test VMs to carry out more tests.

Until then, I can only recommend to stay away from the backported 3.13 kernel, at least for virtual machines.

Nagios check_linux_raid missing in Ubuntu 14.04 LTS

I just upgrade a KVM virtual machine host from Ubuntu 12.04 LTS (Precise Pangolin) to Ubuntu 14.04 LTS (Trusty Tahr). Everything went smoothly except for one problem: The check_linux_raid script is missing in the updated version of the nagios-plugins-standard package.

The nagios-plugins-contrib package seems to contain a script which basically does the same job, but this package has a lot of other plugins that pull tons of additional dependencies, so I did not want to install this package. Luckily, just copying the check_linux_raid script from a system with the older version of Ubuntu worked fine for me.

Windows Server 2012 R2 Windows Update Error 80072F8F

After installing the KB2919355 update, Windows Update would always present error 80072F8F when checking for updates.

Now, one might assume that this is the problem with WSUS 3.x that you can read about everywhere. However, in my case it was not. The WSUS server was using SSL/TLS, however it was running on Windows Server 2012 R2 as well. I looked into this problem for many hours and was mislead by two things: First, if you search for this problem on the Internet, there are so many articles talking about the well-known problem with old WSUS servers, that you hardly find anything else. Second, I also could not the WSUS site (or any other SSL-enabled site) in Internet Explorer on the affected machines.

I still do not know where the problem with Internet Explorer comes from - it might well have existed from the beginning and uninstalling KB2919355 did not fix it. The problem with WSUS however was indeed caused by KB2919355...

The certificate used by the WSUS site is signed by one of our internal certificate authorities. For some reason, which does not matter, the CRL for this specific CA could not be downloaded from the location specified in the server certificate. The server which should have served the CRL sent an HTTP redirect to an invalid URL instead. Before installing KB2919355, this did not matter. Windows Update was still working fine. After installing the update however, Windows seems to download the URL and fail the connection, if the CRL cannot be downloaded. Obviously, this is a much more secure approach. However, there is no message indicating the cause of the problem in the event log, so the administrator has to find the cause of the problem by trial-and-error. This is something Microsoft could really improve.

After fixing the problem with the CRL, so that it could be downloaded correctly, Windows Update worked again without any problems. I am just a bit annoyed because I spent nearly an entire day figuring this out...

Migrating the EJBCA database from H2 to PostgreSQL

I recently installed EJBCA for managing our internal public key infrastructure (PKI). Before using EJBCA, I used openssl from the command-line, but this got uncomfortable, in particular for managing certificate revocation lists (CRLs).

Unfortunately, I made a small but significant mistake when setting up EJBCA: I chose to use the default embedded H2 database. While this database for sure could handle the load for our small PKI, it is inconvenient when trying to make backups: The whole application server needs to be stopped in order to ensure consistency of the backups, a solution which is rather impractical. Therefore I wanted to migrate the EJBCA database from H2 to PostgreSQL.

However, H2 and PostgreSQL are quite different, and the SQL dump generated by H2 could not be easily imported into PostgreSQL. After trying various approaches, I luckily found the nice tool SQuirreL SQL, which (besides other things) can copy tables between databases - even databases of different type. Obviously, this will not solve all migration problems, but for my situation it worked quite well.

I documented the whole migration process in my wiki, in case someone else wants to do the same.

Listing local users with PowerShell in Windows Server 2012 R2

Listing local users with PowerShell is easy: A reference to the local computer object can be created with $computer = [ADSI] "WinNT://." and then user and group accounts can be accessed through the Children property ($computer.Children).

There is only one problem with this: On Windows Server 2012 R2 (and also on Windows 8.1 according to here and here), the process will hang indefinitely when trying to iterate over the children. Even canceling the action with CTRL-C will not work. The only way to get out of this is closing and restarting PowerShell.

Luckily there is a workaround for this bug: When explicitly filtered for user and group objects, the children can be iterated. So before iterating over the Children property, you should configure the schema filter by calling $computer.Children.SchemaFilter.AddRange(@("user", "group")) and everything should work nicely.

PS: I would have posted this solution to the two site discussing this issue. However, both force you to register before posting anything. In my opinion this is a great way of keeping people from participating.

Bare-Metal Recovery of Windows Server 2012 R2 using Bacula

I have been using Bacula as our main backup system for years. While Bacula works perfectly for Linux systems, bare-metal recovery (also known as disaster recovery) of Windows systems has been an open issue ever since.

The Bacula manual describes some procedures, but they only apply to systems running an operating system not newer than Windows Server 2003 R2. Even these procedures remain a bit unclear. If you look for solutions that cover Windows Server 2008 and newer versions of Windows, you will only find a few mailing-list posts that discuss using Windows Server Backup in combination with Bacula. However, none of these solutions sound very appealing.

I believe that you do not have a backup unless you tested the restore, I wanted to find out the best way for backing up a Windows system with Bacula. So I spent some time and installed a Windows Server 2012 R2 system in a virtual machine, made a backup with Bacula, and then tried to restore this backup in a new virtual machine. I actually succeeded without using Windows Server Backup or any other third-party tool. It really seems to work with a Bacula-only solution.

I documented the steps I used in the wiki, just in case I might have to restore a Windows System from a Bacula backup in the future. Maybe this guide is useful for you as well.

Blinking Cursor but System not Booting on HP ProLiant DL160 G6 Server

Recently I experienced a strange problem with a HP ProLiant DL160 G6 server:

Sometimes after seeing the BIOS initialization messages, the system would not boot but just show a blank screen with a blinking cursor. After power-cycling, this problem sometimes would disappear and sometimes it would appear again.

Frankly this problem puzzled me. Luckily, someone else had experienced this problem before and found the reason:

This problem can be caused by an incompatible KVM console. In my case I had been using a Sharkoon PS/2 to USB adapter in order to connect the system to an ATEN KVM switch (the server does not have PS/2 and the KVM switch does not have USB). After I connected a USB keyboard directly, the problem disappeared, even if the PS/2 to USB adapter was connected in parallel.

Unfortunately I have not figured out yet, whether the problem is caused by the adapter, the KVM switch or the PS/2 keyboard. Maybe I will try a different adapter and report, whether this fixes the problem.

OpenLDAP Server not listening on IPv6 Socket in Zimbra 8

Recently I have been experiencing a strange with an installation of the Community Edition of Zimbra Collaboration Server 8: Although all services were running, no e-mails were delivered. In the log file /var/log/zimbra.log I found messages like "zimbra amavis[9323]: (09323-01) (!!)TROUBLE in process_request: connect_to_ldap: unable to connect at (eval 111) line 152.".

The strange things about this was, that the OpenLDAP daemon (slapd) was running and answering requests. After restarting Zimbra (/etc/init.d/zimbra restart), the problem disappeared, however it reappeared after the next reboot.

After some time I figured out, that - right after the reboot - slapd was only listening on an IPv4 socket, not on an IPv6 socket. After restarting the OpenLDAP server (ldap stop && ldap start as user zimbra), the problem disappeared again and netstat showed that now slapd was also listening on the IPv6 socket.

In the end I could not figure out, why the OpenLDAP daemon would only listen on IPv4 when started during system boot but would listen on both IPv4 and IPv6 when started later. I was suspecting some problem with name resolution in the early boot process (although both the IPv4 and the IPv6 address were listed in /etc/hosts).

However, I found a work-around for the problem: By setting the local configuration option ldap_bind_url to ldap:/// (zmlocalconfig -e ldap_bind_url=ldap:///) , I could configure OpenLDAP to listen on all local interfaces, which apparently fixed the problem.

RTFM or better don't...

While I am writing about curious bugs, here is another one, although technically it is not really a bug.

When setting up Icinga with mod_gearman, I wondered why service-checks where running on the assigned mod_gearman worker node, but host-checks were running on the main Icinga server and were not distributed using mod_gearman. I checked the configuration again and again, but could not find an error. Also searching the web did not bring much useful information.

The only thing that I could find were hints that do_hostchecks had to set to "yes" in /etc/mod-gearman/module.conf. But according to the mod_gearman documentation, this option was set to "yes" by default.

Well, as it turns out, the flag is set to "no" by default, at least in the version of mod_gearman that is available in the software repositories of Ubuntu 12.04 LTS (Precise Pangolin). By the way, the manual that is distributed in the source archive of mod_gearman 1.2.2 (this is the same version that comes with Ubuntu) says the same, so it is not a thing that was changed recently.

OpenDKIM bug in Zimbra Collaboration Server

Recently I stumbled across a bug in the OpenDKIM configuration of the Zimbra Collaboration Server.

In ZCS 8.0.3 (Community Edition, but I guess the same applies to the Network Edition), the file /opt/zimbra/conf/opendkim.conf.in specifies the socket that OpenDKIM listens on in the following way:

Socket                %%zimbraInetMode%%:8465@[%%zimbraLocalBindAddress%%]

This results in the following socket address of "inet6:8465@[::1]" in the final file (opendkim.conf). However, the Postfix configuration file /opt/zimbra/postfix/conf/master.cf.in specifies the socket as "inet:localhost:8465". This leads to Postfix trying to connect to an IPv4 socket, while OpenDKIM is listening on an IPv6 socket, so that the connection cannot be established.

The fix is quite easy: By changing "%%zimbraInetMode%%:8465@[%%zimbraLocalBindAddress%%]" to "inet:8465@[127.0.0.1]" in opendkim.conf.in and restarting Zimbra, OpenDKIM can be made to listen on an IPv4 socket, so that Postfix can connect again.

The curious thing is, that this bug has already been reported half a year ago and has supposedly been fixed. However, it seems like this fix was only applied to the 9.0 branch of Zimbra and not to Zimbra 8.0.

Update on KVM Shutdown on Ubuntu

About two years ago, I wrote an article about how to make libvirt on Ubuntu 10.04 LTS to shutdown the virtual machines gracefully, when the host system is shutdown or rebooted.

Now I recently found out, that they implemented a similar approach in Ubuntu 12.04 LTS. The only problem with this is, that the default timeout is too short (30 seconds) for virtual machines running complex services. Therefore, I documented how to change this timeout in my wiki.

Less Trouble with KVM virtio and DHCP

In an earlier blog post I claimed that I was seeing problems with VMs using the virtio driver for networking on an Ubuntu 12.04 LTS KVM host using DHCP.

However, as far as I am concerned, this claim was wrong. I now figured out, that the messages about bad UDP checksums had nothing to do with my problem. I was rather experiencing the problems caused by a configuration that did not list the VLAN network interface (eth0.X) on which the DHCP relay agent received the answers from the DHCP server.

The mean thing is, that switch away from virtio fixed this problem. However, this was not because of the UDP checksums now being right (this was merely a side effect). It fixed the problem, because when not using the virtio driver, the DHCP relay agent would receive the answer packets, even if they were received on a VLAN interface it was not listening to. I can only guess that the implementation for VLAN-tagged interfaces is slightly different when using the virtio driver.

After adding the interface to the list of interfaces used by the DHCP relay agent, the DHCP packets are relayed correctly, even if using the virtio driver. The messages about bad UDP checksums now reappeared in the log file, but obviously this is not causing any trouble.

On the other hand, according to a bug report some users really seem to have problems with DHCP when using the virtio driver. However, this might only affect Ubuntu 12.04 LTS guests but not VMs on a Ubuntu 12.04 LTS host.

Trouble with KVM virtio and DHCP

Lately I experienced a problem with KVM-based virtual machine running a DHCP server and another one running a DHCP relay (for both I use the ISC implementations). The DHCP relay was complaining about "bad udp checksums". Using tcpdump and wireshark I quickly found out, that the software was right and the UDP checksums were in fact wrong. After some searching, I found a bug report, that basically described the same problem.

Although I cannot verify this, I think the problem might be related to the fact that I recently upgraded the host machine from Ubuntu 10.04 LTS (Lucid) to Ubuntu 12.04 LTS (Precise). As a workaround, I deactivated the use of the "virtio" support for the network interface in both virtual machines, which seems to fix the problem, because then the UDP checksums are correct.

However, when I performed the same change for a virtual machine still running on an Ubuntu 10.04 LTS host, this actually caused a problem: If VLAN interfaces are used inside the virtual machine, the normal non-virtio driver will screw things up on a Ubuntu 10.04 LTS host.

Long story short: For virtual machines running on a Ubuntu 10.04 LTS host you should use and for a Ubuntu 12.04 LTS you should avoid virtio networking.

Update [2012-07-08]: It seems like the conclusions I draw in this article are actually wrong. Therefore I posted an update clarifying the situation.