Skip to content

Non-blocking DatagramChannel and empty UDP packets

I just found out the hard way that there are two bugs when using a non-blocking DatagramChannel in Java with empty (zero payload size) UDP packets.

The first one is not so much a bug but more an API limitation: When sending an empty UDP packet, you cannot tell whether it has been actually sent. The method returns the number of bytes sent and returns zero when the packet has not been sent, but if you send an empty packet, the number of bytes sent is zero even if the send operation suceeded. So there is no way to tell whether the send operation was successful.

The second bug is more serious and this one clearly is a bug in the implementation. When using a DatagramChannel with a Selector, the selector does not return from its select operation when an empty packet has been received. This means that the select call might block forever and you will only see the empty packets received once a non-empty packet arrives.

I describe the two problems and a possible workaround in more detail in my wiki. In the project that I am working on, I can live with not knowing for sure whether the packet was sent (I just try again later if there is no reaction) and for the case where I have to receive packets, I am now using blocking I/O. However, I still think that this is a nasty bug that should be fixed in a future release of Java.

Process tracking in Upstart

Recently, I exprienced a problem with the process tracking in Upstart:

I wanted start a daemon process running as a specific user. However, I needed root privileges in the pre-start script, so I could not use the setuid/setgid options. I tried to use su in the exec option, but then Upstart would not track the right process. The expect fork and expect daemon options did not help either. As a nasty side effect, these options cannot be tested easily, because having the wrong option will lead to Upstart waiting for an already dead process to die and there is no way to reset the status in Upstart. At least, there is a workaround for effectively resetting the status without restarting the whole computer.

The problem is that su forks when running the command it is asked to run instead of calling exec from the main process. Unfortunately, The process I had to run would fork again because I had to run it as a daemon (not running it as a daemon had some undesirable side effects). Finally, I found the solution: Instead of using su, start-stop-daemon can be used. This tool will not fork and therefore it will not upset Upstart's process tracking. For example, the line

exec start-stop-daemon --start --chuid daemonuser --exec /bin/server_cmd
will run /bin/server_cmd as daemonuser without forking.

This way, expect fork or expect daemon can be used, just depending on the fork behavior of the final process.

iTunes Bug: Apps not syncing to iPhone or iPad

While setting up my new iPad yesterday, I experienced a strange problem. iTunes (on Windows) would repeatedly crash with a problem in msvcrt10.dll when trying to copy the apps to the iPad.

In the Apple support forums, I found the explanation and a workaround for this problem: It seems like iTunes 11.4 introduces a but (that is still present in iTunes 12) that causes a crash when apps are stored on network share and referenced using a UNC path. In my case, the Music folder (which is the default location for the iTunes library) is redirected to a UNC path pointing to a DFS share. Interestingly, this bug only affects apps, not music or videos.

In order to make the apps sync again, the path that iTunes uses for referencing the files needs to be shared to a regular path with a drive letter. This can either be achieved by copying the apps to a local driver or mapping the network share to a drive letter. Either way, all apps need to be deleted from the iTunes library (but not deleted on disk) and re-added using the regular path. Obviosuly, iTunes has to be configured to not automatically copy files to its default library location. After this change, the synchronization should work. Finally, the apps can be deleted again and re-added using the UNC path - once the apps are on the device (with the newest version) iTunes will not try to copy them again, thus avoiding the bug.

However, I find it annoying that this bug has been known since mid of September and still has not been fixed by Apple.

Trouble with IPv6 in a KVM guest running the 3.13 kernel

Some time ago, I wrote about two problems with the 3.13 kernel shipping with Ubuntu 14.04 LTS Trusty Tahr: One turned out to be a problem with KSM on NUMA machines acting as Linux KVM hosts and was fixed in later releases of the 3.13 kernel. The other one affected IPv6 routing between virtual machines on the same host. Finally, I figured out the cause of the second problem and how it can be solved.

I use two different kinds of network setups for Linux KVM hosts: For virtual-machine servers in our own network, the virtual machines get direct bridged access to the network (actually I use OpenVSwitch on the VM hosts for bridging specific VLANs, but this is just a technical detail). For this kind of setup, everything works fine, even when using the 3.13 kernel. However, we also have some VM hosts that are actually not in our own network, but are hosted in various data centers. For these VM hosts, I use a routed network configuration. This means that all traffic coming from and going to the virtual machines is routed by the VM host. On layer 2 (Ethernet), the virtual machines only see the VM host and the hosting provider's router only sees the physical machine.

This kind of setup has two advantages: First, it always works, even if the hosting provider expects to only see a single, well-known MAC address (which might be desirable for security reasons). Second, the VM host can act as a firewall, only allowing specific traffic to and from the outside world. In fact, the VM host can also act as a router between different virtual machines, thus protecting them from each other should one be compromised.

The problems with IPv6 only appear when using this kind of setup, where the Linux KVM host acts as a router, not a bridge. The symptoms are that IPv6 packets between two virtual machines are occasionally dropped, while communication with the VM host and the outside world continues to work fine. This is caused by the neigbor-discovery mechanism in IPv6. From the perspective of the VM host, all virtual machines are in the same network. Therefore, it sends an ICMPv6 redirect message in order to indicate that the VM should contact the other VM directly. However, this does not work because the network setup only allows traffic between the VM host and individual virtual machines, but no traffic between two virtual machines (otherwise it could not act as a firewall). Therefore, the neighbor-discovery mechanism determines the other VM to be not available (it should be on the same network but does not answer). After some time, the entry in the neighbor table (that you can inspect with ip neigh show) will expire and communication will work again for a short time, until the next redirect message is received and the same story starts again.

There are two possible solutions to this: The proper one would be to use an individual interface for each guest on the VM host. In this case, the VM host would not expect the virtual machines to be on the same network and thus stop sending redirect packets. Unfortunately, this makes the setup more complex and - if using a separate /64 for each interface - needs a lot of address space. The simpler albeit sketchy solution is to prevent the redict messages from having any effect. For IPv4, one could disable the sending of redirect messages through the sysctl option net.ipv4.conf.<interface>.send_redirects. For IPv6 however, this option is not available. So one could either use an iptables rule on the OUTPUT chain for blocking those packets or simply configure the KVM guests to ignore such packets. I chose the latter approach and added

# IPv6 redirects cause problems because of our routing scheme.
net.ipv6.conf.default.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0

to /etc/sysctl.conf in all affected virtual machines.

I do not know, why this behavior changed with kernel 3.13. One would expect the same problem to appear with older kernel versions, but I guess there must have been some change in the details of how NDP and redirect messages are handled.

Addendum (2014-11-02):

Adding the suggested options to sysctl.conf does not seem to fix the problem completely. For some reasons, an individual network interface can still have this setting enabled. Therefore, I now added the following line to the IPv6 configuration of the affected interface in /etc/network/interfaces:

        post-up sysctl net.ipv6.conf.$IFACE.accept_redirects=0

This finally fixes it, even if the other options are not added to sysctl.conf.

How we treat women in the IT industry

Some time ago I discovered the video series "Tropes vs Women in Video Games" created by Anita Sarkeesian. I found these videos very interesting as they show in an entertaining way how women are depicted in pop culture in general and in video games specifically.

Unfortunately, Anita has faced online harassments since the start of the Kickstarter campaign for Tropes vs Women in Video Games. It seems like some persons feel threatened by someone who just wants to expose how the entertainment industry presents women in movies and video games (the "damsel in distress" trope is the most common trope, that probably everyone has seen in a movie or game). To make it very clear: Anita does not campaign for any video games to be abolished. She just shows, how many (or even most) video games present a distorted image of women. Obviously, the gaming industry suffers itself from this fact, because many gamers (regardless of gender) are annoyed by the lack of strong female characters in most video games. In acknowledgement of her work, Anita has received the 2014 Game Developers Choice Ambassador Award.

During the last days, a yet unknown person harassed Anita on Twitter in an unprecedented way: The person not just insulted her but actually threatened to murder her and her family. The reactions to these threats are nearly as disturbing as the threats themselves: In the discussion boards of Heise Online (German only), many people argue that there is no systematic discrimination of women in the IT industry. However, even if one ignores the current example (and argues that Anita is not part of this industry), women are obviously discriminated in our industry: I recommend reading an interesting article written by the founder of a Silicon Valley based startup trying to find investors: Many of those investors are more interested in her than in her business and it is more routine than exception that they hit on her - even when she shows clearly that she is not interested.

We should all reflect on how we treat women in our industry and people like Anita Sarkeesian help us in doing so. Therefore, today I donated some money to her project "Feminist Frequency". I had already planned this for a long time, but the most recent events made sure that I did not wait any longer. When, if not in this troubling times, is the right time to show support? I can only ask everyone in the IT industry to also think about how we treat women and to support those that are courageous and speak up.

Update on KVM problems with kernel 3.13

A few weeks ago, I wrote about problems with kernel 3.13 on Ubuntu 12.04 LTS  and 14.04 LTS.

Most likely, the problem that caused the excessive CPU load and occassional high network latencey has been fixed by now and the fix is going to be included in version 3.13.0-33 of the kernel package. I experienced this problem on a multi-processor machine, so it is probable that this was the problem with KSM and NUMA that has been fixed.

I am not sure, whether the problems that I had  with IPv6 connectivity are also solved by this fix: I had experienced those problems on a single-processor (but multi-core) machine, so it does not sound like a NUMA problem to me.

Anyhow, I will give the 3.13 kernel another try when the updated version is released. For the moment, I have migrated all server machines back to the 3.2 kernel, because the 3.5 kernel's end-of-life is soon and the 3.13 kernel has not been ready for production use yet. I do not expect to have considerable gains by using a newer kernel version on the servers anyway, so for the moment, the 3.2 kernel is a good option.

Teuer aber schlecht übersetzt

Da kauft man eine Software für 500 EUR und muss sich dann so etwas ansehen:

Übersetzungsfehler in Visio 2013

Wie solche Fehler passieren ist klar: Listen mit Ausdrücken werden an Übersetzungsbüros gegeben, welche diese dann ohne Kenntnis des Kontextes übersetzen. Allerdings sollten solche schwerwiegenden Fehler dann von der Qualitätssicherung gefunden werden.

Natürlich könnte man die GUI-Sprache jetzt einfach auf Englisch stellen und damit solche blöden Fehler vermeiden. Microsoft ist aber der Meinung, dass eine Sprache genug ist und man doch bitte ein zusätzliches Sprachpaket erwerben soll, wenn man die Oberfläche auf Englisch umstellen möchte. Ich finde, dass das ein ziemliches Armutszeugnis ist, insbesondere wenn man betrachtet, dass Visio Professional eines der teuersten Produkte in der Microsoft Office Familie ist.

Categories: IT

Linux KVM Problems with Ubuntu 14.04 LTS / Kernel 3.13.0-30

A few days ago I upgraded a virtual-machine host from Ubuntu 12.04 LTS (Precise Pangolin) to Ubuntu 14.04 LTS (Trusty Tahr). First, I thought that everything was working fine.

However, a short time later I noticed funny problems with the network connectivtity, particularly (but not only) affecting Windows guests. Occasionally, ICMP echo requests would only be answered with an enormous delay (seconds) or sometimes not even be answered at all. TCP connections to guests would stall very often. At the same time the load on the host system would be high even though the CPU usage would not be extremely heavy.

After I downgraded the virtual-machine host back to Ubuntu 12.04 LTS (and consequently to kernel 3.5) this problems disappeared immediately.

It seems like this is a bug related to the 3.13 kernel shipped with Ubuntu 14.04 LTS. There is a bug report on Launchpad and a discussion on Server Fault. It might be that the other problems that I experienced with the backported 3.13 kernel are related to this issue.

For the moment I will keep our virtual-machine hosts on Ubuntu 12.04 LTS and kernel 3.5, until the problems with the 3.13 kernel have been sorted out.

Trouble after installing linux-generic-lts-trusty in Ubuntu 12.04 LTS

Yesterday I updated a lot of computers (hosts as well as virtual machines) running Ubuntu 12.04 LTS (Precise Pangolin) to the backported version of the 3.13 kernel. This kernel is provided by the linux-image-generic-lts-trusty package which is installed (together with the linux-headers-generic-lts-trusty package) when installing linux-generic-lts-trusty. By installing the backported kernel (before the update all Ubuntu 12.04 LTS systems where running on the 3.5 kernel provided by linux-generic-lts-quantal) I wanted to increase the uniformity between the Ubuntu 12.04 LTS and Ubuntu 14.04 LTS systems.

After installing the new kernel and rebooting the machines, funny network problems started to happen. For some virtual machines, IPv6 communication between virtual machines running on the same VM host became very unreliable. For other virtual machines, I experienced occassional huge delays (up to several seconds) for IPv4 packets.

After testing around for a few hours (at the same time I had upgraded a virtual-machine host to Ubuntu 14.04 LTS and first suspected this upgrade, specifically the new version of OpenVSwitch), I found out that these network problems were indeed caused by the new kernel in the virtual machines. If one of two virtual machines running on the same host had the new kernel running, the problems with IPv6 appeared. If both were running the old kernel version, the problems disappeared. The other problem with the massively delayed IPv4 packets was a bit harder to reproduce. Funnily, it already became much better when I downgraded just one of the virtual machines on the host.

At the current stage (linux-image-generic-lts-quantal-3.13.0-30), there seems to be a massive problem with the IP stack of the kernel. For some reasons, this problems only seem to be triggered if the kernel is running in a (Linux KVM) virtual machine. For now, I downgraded all virtual machines back to the old kernel version.

I have to do some more tests to find out whether these problems are caused by the newer kernel in general or whether they are specifically caused by the backported version. At the moment I only have one virtual machine with Ubuntu 14.04 LTS, so I will have to setup some test VMs to carry out more tests.

Until then, I can only recommend to stay away from the backported 3.13 kernel, at least for virtual machines.

Nagios check_linux_raid missing in Ubuntu 14.04 LTS

I just upgrade a KVM virtual machine host from Ubuntu 12.04 LTS (Precise Pangolin) to Ubuntu 14.04 LTS (Trusty Tahr). Everything went smoothly except for one problem: The check_linux_raid script is missing in the updated version of the nagios-plugins-standard package.

The nagios-plugins-contrib package seems to contain a script which basically does the same job, but this package has a lot of other plugins that pull tons of additional dependencies, so I did not want to install this package. Luckily, just copying the check_linux_raid script from a system with the older version of Ubuntu worked fine for me.

Windows Server 2012 R2 Windows Update Error 80072F8F

After installing the KB2919355 update, Windows Update would always present error 80072F8F when checking for updates.

Now, one might assume that this is the problem with WSUS 3.x that you can read about everywhere. However, in my case it was not. The WSUS server was using SSL/TLS, however it was running on Windows Server 2012 R2 as well. I looked into this problem for many hours and was mislead by two things: First, if you search for this problem on the Internet, there are so many articles talking about the well-known problem with old WSUS servers, that you hardly find anything else. Second, I also could not the WSUS site (or any other SSL-enabled site) in Internet Explorer on the affected machines.

I still do not know where the problem with Internet Explorer comes from - it might well have existed from the beginning and uninstalling KB2919355 did not fix it. The problem with WSUS however was indeed caused by KB2919355...

The certificate used by the WSUS site is signed by one of our internal certificate authorities. For some reason, which does not matter, the CRL for this specific CA could not be downloaded from the location specified in the server certificate. The server which should have served the CRL sent an HTTP redirect to an invalid URL instead. Before installing KB2919355, this did not matter. Windows Update was still working fine. After installing the update however, Windows seems to download the URL and fail the connection, if the CRL cannot be downloaded. Obviously, this is a much more secure approach. However, there is no message indicating the cause of the problem in the event log, so the administrator has to find the cause of the problem by trial-and-error. This is something Microsoft could really improve.

After fixing the problem with the CRL, so that it could be downloaded correctly, Windows Update worked again without any problems. I am just a bit annoyed because I spent nearly an entire day figuring this out...

Migrating the EJBCA database from H2 to PostgreSQL

I recently installed EJBCA for managing our internal public key infrastructure (PKI). Before using EJBCA, I used openssl from the command-line, but this got uncomfortable, in particular for managing certificate revocation lists (CRLs).

Unfortunately, I made a small but significant mistake when setting up EJBCA: I chose to use the default embedded H2 database. While this database for sure could handle the load for our small PKI, it is inconvenient when trying to make backups: The whole application server needs to be stopped in order to ensure consistency of the backups, a solution which is rather impractical. Therefore I wanted to migrate the EJBCA database from H2 to PostgreSQL.

However, H2 and PostgreSQL are quite different, and the SQL dump generated by H2 could not be easily imported into PostgreSQL. After trying various approaches, I luckily found the nice tool SQuirreL SQL, which (besides other things) can copy tables between databases - even databases of different type. Obviously, this will not solve all migration problems, but for my situation it worked quite well.

I documented the whole migration process in my wiki, in case someone else wants to do the same.

Listing local users with PowerShell in Windows Server 2012 R2

Listing local users with PowerShell is easy: A reference to the local computer object can be created with $computer = [ADSI] "WinNT://." and then user and group accounts can be accessed through the Children property ($computer.Children).

There is only one problem with this: On Windows Server 2012 R2 (and also on Windows 8.1 according to here and here), the process will hang indefinitely when trying to iterate over the children. Even canceling the action with CTRL-C will not work. The only way to get out of this is closing and restarting PowerShell.

Luckily there is a workaround for this bug: When explicitly filtered for user and group objects, the children can be iterated. So before iterating over the Children property, you should configure the schema filter by calling $computer.Children.SchemaFilter.AddRange(@("user", "group")) and everything should work nicely.

PS: I would have posted this solution to the two site discussing this issue. However, both force you to register before posting anything. In my opinion this is a great way of keeping people from participating.

Bare-Metal Recovery of Windows Server 2012 R2 using Bacula

I have been using Bacula as our main backup system for years. While Bacula works perfectly for Linux systems, bare-metal recovery (also known as disaster recovery) of Windows systems has been an open issue ever since.

The Bacula manual describes some procedures, but they only apply to systems running an operating system not newer than Windows Server 2003 R2. Even these procedures remain a bit unclear. If you look for solutions that cover Windows Server 2008 and newer versions of Windows, you will only find a few mailing-list posts that discuss using Windows Server Backup in combination with Bacula. However, none of these solutions sound very appealing.

I believe that you do not have a backup unless you tested the restore, I wanted to find out the best way for backing up a Windows system with Bacula. So I spent some time and installed a Windows Server 2012 R2 system in a virtual machine, made a backup with Bacula, and then tried to restore this backup in a new virtual machine. I actually succeeded without using Windows Server Backup or any other third-party tool. It really seems to work with a Bacula-only solution.

I documented the steps I used in the wiki, just in case I might have to restore a Windows System from a Bacula backup in the future. Maybe this guide is useful for you as well.

Blinking Cursor but System not Booting on HP ProLiant DL160 G6 Server

Recently I experienced a strange problem with a HP ProLiant DL160 G6 server:

Sometimes after seeing the BIOS initialization messages, the system would not boot but just show a blank screen with a blinking cursor. After power-cycling, this problem sometimes would disappear and sometimes it would appear again.

Frankly this problem puzzled me. Luckily, someone else had experienced this problem before and found the reason:

This problem can be caused by an incompatible KVM console. In my case I had been using a Sharkoon PS/2 to USB adapter in order to connect the system to an ATEN KVM switch (the server does not have PS/2 and the KVM switch does not have USB). After I connected a USB keyboard directly, the problem disappeared, even if the PS/2 to USB adapter was connected in parallel.

Unfortunately I have not figured out yet, whether the problem is caused by the adapter, the KVM switch or the PS/2 keyboard. Maybe I will try a different adapter and report, whether this fixes the problem.