Open-Source | Entries from Sunday, November 27. 2016

Design flaws of the Linux page cache

November '16

Posted by Sebastian Marsching on Sunday, November 27. 2016

Like most modern operating systems, Linux has a page cache. This means that when reading from or writing to a file, the data is actually cached in the system’s memory, so that subsequent read requests can be served much faster, without having to access the non-volatile storage.

This is an extremely useful facility, which has a huge impact on system performance. For typical workloads, a significant amount of data is used repeatedly and caching this data reduces the latency dramatically. For example, imagine that every time you ran “ls”, the executable would actually be read from the hard disk. This would undoubtedly make the interactive experience much less responsive.

Unfortunately, there is a downside to how Linux caches file-system data: As long as there is free system memory, the caching is a no-brainer and simply caching everything works perfectly. After some time however, the system memory will be used completely and the kernel has to decide how to free memory in order to cache I/O data. Basically, Linux will evict old data from the cache and even move process data to the swap area in order to make space for data newly arriving in the cache. In most cases, this makes sense: If data has not been used for some time (even data that is not from a disk but belongs to a process which simply has not needed it in some time), it makes sense to evict it from the cache or put it into the swap area so that the fast memory can be used for fresh data that is more likely to be needed again soon. Most times, these heuristics work great, but there is one big exception: Processes that read or write a lot of data, but only do so once.

For example, imagine a backup process. A backup process will read nearly all data stored on disk (and maybe write it to an archive file), but caching the data from this read or write operations does not make a lot of sense because it is not very likely to be accessed again soon (at least not more likely than if the backup process did not run at all). However, the Linux page cache will still cache that data and move other data out of memory. Now, accessing that data again will result in slow read requests from the hardware, effectively slowing the system down, sometimes to a point that is worse than if the page cache had been disabled completely.

For a long time, I never thought about this problem. In retrospect, systems sometimes seemed slow after running a backup, but I never connected the dots until recently I started to see a problem that looked weird at first: Every night, I got an e-mail from the Icinga monitoring system warning me about the swap space on two (very similar) systems running extremely low. First, I expected that some nightly maintenance problem might simply need a lot of memory (I recently had installed a software upgrade on these systems), so I assigned more memory to the virtual machines. I expected that the amount of free swap space would increase by the amount of extra memory assigned, but it did not. The swap space was still used almost completely. Therefore, I looked at the memory consumption while the problem was present, and the result was astonishing: Both memory and swap were virtually fully used, but about 90 percent of the memory was actually used by the page cache, not by any process.

Investigating the problem closer revelealed that the high usage of swap space always occurred shortly after the nightly backup process started running. To be absolutely sure, I manually started the process during the day and I could immediately see the page cache and swap usage growing rapidly. It was clear that the backup process reading a lot of data was responsible for the problem. If you think about it, this is not the kernel’s fault: It simply cannot know that the data will not be needed again and moving other data out of the way is actually counter productive.

So everything that is needed to fix this kind of problem is a way to tell the kernel “please do not cache the data that I read, I will not need it again anyway”. While this sounds simple, it actually is a huge problem with current kernel releases.

There is no way to tell the kernel “do not cache the data accessed by this process”. In fact, there are only four ways for influencing the caching behavior that I am aware of:

The page cache can be disabled globally for the time of running the backup process. While it might alleviate the impact of the problem, it is still not desirable because the page cache actually is useful and disabling it will have a negative impact on system performance. However, during the backup process, disabling the page cache might still result in better performance than not doing anything at all.
Some suggest mounting the file-system with the “sync” flag. However, this will only circumvent the page cache for write requests, not for read requests. It also has very negative impacts on performance, so I only list it here because it is suggested by some people.
The files can be opened with the O_DIRECT flag. This tells the kernel that I/O should bypass the caches. Unfortunately, using O_DIRECT has a lot of side effects. In particular, the size and offset of the memory buffer used for I/O operations has to match certain alignment restrictions that depend on the Kernel version used and the file-system type. There is no API for querying these restrictions, so one can only choose alignment to a rather large size and hope that it is sufficient. Of couse, this also means that simply modifying the code of an application that opens a file is not sufficient. Every piece of code that reads from or writes to a file has to be changed so that it matches the alignment restrictions. This definitely is not an easy task, so using O_DIRECT is more of a theoretical approach.
Last but not least, there is the posix_fadvise function. This function allows a program in user space to tell the kernel how it is going to use a file (e.g. read it sequentially), so that the kernel can optimize things like the file-system cache. There are two flags that can be passed to this function which sound particularly promising: According to the man page, the POSIX_FADV_NOREUSE flag specifies that “the specified data will be accessed only once” and the POSIX_FADV_DONTNEED flag specifies that “the specified data will not be accessed in the near future”.

So it sounds like posix_fadvise can solve the problem and all we need to do is find a way to tell the program (tar in my case) to call it. As it turns out, there even is a small tool called “nocache” that acts as a wrapper around arbitrary programs and uses the LD_PRELOAD mechanism to catch attempts to open a file and adds POSIX_FADV_NOREUSE right after opening the file.

This would be the perfect solution if the Linux kernel actually cared about POSIX_FADV_NOREUSE. Unfortunately, till this day, it simply ignores this flag. In 2011, there was a patch that tried to add support for this flag to the kernel, but it never made it into the mainline kernel (the reasons are unknown to me).

Actually, the nocache tool is aware of this and adds a workaround: When closing the file, it calls posix_fadvise with the POSIX_FADV_DONTNEED flag. It has to do this when closing the file (instead of when opening it) because POSIX_FADV_DONTNEED only removes data from the page cache, it does not prevent it from being added to the page cache.

So I installed nocache, and wrapped the call to tar with it, expecting that this would finally solve the problem. Surprisingly, it did not. At first, it seemed like the page cache was filling less rapidly, but after a while, everything looked quite like before. I tried to add the “-n” parameter to nocache, telling it to call posix_fadvise multiple times (the documentation suggested to do so), but this did not help either.

After giving this some more thought, I realized why: Using POSIX_FADV_DONTNEED when closing the file works great when working with many small or medium-sized files. For large files, however, it does not work. By the time it is called, all of the file has already been put into the page cache, causing the same problems as when many small files are read without nocache. This means that posix_fadvise has to be called repeatedly while reading from the file to ensure that the amount of data cached never grows too large. While there are only a few API calls for opening files, there are ways for reading from a file (e.g. memory-mapped I/O) that nocache simply cannot catch. This means that the only solution is actually patching the program that is reading or writing data.

This is why I created a patch for GNU tar version 1.29. This patch calls posix_fadvise after reading each block of data and thus ensures that the page cache is never polluted by tar. Unfortunately, this patch is not portable, nor does it provide a command-line argument for enabling or disabling this behavior, so it is not really suitable for inclusion into the general source code of GNU tar. The patch only takes care of not polluting the file-system cache when reading files. It does not do so when writing them, but for me this is sufficient because in my backup script, tar writes to the standard output anyway.

When using this patched version of tar, the memory problems disappear completely. This makes me confident that the change is sufficient, so I am going to use this patched version of tar for all systems that I backup with tar. Unfortunately, I use Bareos for the backup of most systems, so I will have to find a solution for that software, too. Maybe we are lucky, and support for POSIX_FADV_NOREUSE will finally be added to Linux at some point in the future, but until then patching the software involved in the backup process seems like the only feasible way to go.