The file system cache (buffer cache) helps programs to get to their data
blocks faster by keeping recently used file blocks in memory. If you copy a
large file tree, this has a devestating effect on the cache since all the
copied data will also end up in the cache, force other data blocks out of
the cache. This is very bad for system performance since of all the other
processes on the system that had their data blocks in the cache before the
copying started will suddenly have to reead data from disk again. Using
posix_fadvise
you can hint the OS that it should drop certain file blocks from the cache.
Together with information from mincore
that tells us which blocks are
currently cached we can alter applications to work without disturbing the
buffer cache. This article shows how this works, using rsync as an example.
posix_fadvise
functionThe posix_fadvise
function allows you to give the OS advice regarding your
expected use of the data associated with an open file handle. The calling
convention looks like this:
#include <sys/fcntl.h>
int posix_fadvise( int fd, off_t offset, off_t len, int advice );
int posix_fadvise64( int fd, off_t offset, off_t len, int advice );
The offset gives the start of the area you are giving advice on. The len is the length of the area. If len is zero all bytes starting from offset will be affected by the call. The advice parameter specifies the type of advice.
The advice we are interested in here, is called POSIX_FADV_DONTNEED
. It
tells the OS that we will not be needing the specified bytes again. The
effect of this is, that the bytes will be released from the file system
cache. The following mini program will tell the OS to release all
data associated with a particular file from the cache.
#define _XOPEN_SOURCE 600
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char *argv[]) {
int fd;
fd = open(argv[1], O_RDONLY);
fdatasync(fd);
posix_fadvise(fd, 0,0,POSIX_FADV_DONTNEED);
close(fd);
return 0;
}
As you can see we are calling fdatasync
right before calling
posix_fadvise
, this makes sure that all data associated with the file
handle has been committed to disk. This is not done because there is any
danger of loosing data. But it makes sure that that the posix_fadvise
has
an effect. Since the posix_fadvise
function is advisory, the OS will
simply ignore it, if it can not comply. At least with Linux, the effect of
calling posix_fadvise(fd,0,0,POSIX_FADV_DONTNEED)
is immediate. This
means if you write a file and call posix_fadvise
right after writing a
chunk of data, it will probably have no effect at all since the data in
question has not been committed to disk yet, and therefore can not be
released from cache.
As of this writing (2.6.21) Linux does not remember POSIX_FADV_DONTNEED
advice for an open file. It acts when the advice is given, and when it can
not comply it forgets the advice. So it is up to you to make sure Linux can
comply.
mincore
functionBeing able to tell the OS to drop a file from the cache is nice, but if the file has been cached before our program touched it, we should not drop the cache, since the file has been cached for a reason. Most likely some other application using it.
Since the whole point of the exercise is to NOT disturb the filesystem cache we need a way to figure out, which blocks of a file are present in the cache before we touch the file.
The mincore
function tells us just this. Its usage is a bit complicated,
since it works on memory in general and not only on files.
int mincore(void *memory_pointer, size_t file_length, unsigned char *vec);
The fist step is to memory map the file. Then call mincore
on the memory
pointer to get information about the cache-state of each block in the file.
Here is a small example program, that will list which blocks of a file are in cache.
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/mman.h>
int main(int argc, char *argv[]) {
int fd;
struct stat file_stat;
void *file_mmap;
unsigned char *mincore_vec;
size_t page_size = getpagesize();
size_t page_index;
fd = open(argv[1],0);
fstat(fd, &file_stat);
file_mmap = mmap((void *)0, file_stat.st_size, PROT_NONE, MAP_SHARED, fd, 0);
mincore_vec = calloc(1, (file_stat.st_size+page_size-1)/page_size);
mincore(file_mmap, file_stat.st_size, mincore_vec);
printf("Cached Blocks of %s: ",argv[1]);
for (page_index = 0; page_index <= file_stat.st_size/page_size; page_index++) {
if (mincore_vec[page_index]&1) {
printf("%lu ", (unsigned long)page_index);
}
}
printf("\n");
free(mincore_vec);
munmap(file_mmap, file_stat.st_size);
close(fd);
return 0;
}
posix_fadvise
and mincore
I use rsync with its hard-link feature for snapshot-like backups. In that context it is very bad when the backup process evicts data from the file system cache. It reduces the performance of the other programs accessing the file system. Given the information from the previous section it was quite simple to implement a patch for rsync that drops cache after read or write operation. The resulting version of rsync has virtually no impact of the file system cache contents.
Calling fdatasync
as in the example above, before closing a file is quite
expensive, especially when dealing with small files. Therefore the patch
introduces a file-handle cache where the files only get synced after some
time has passed. This gives the kernel a chance to write data to disk at its
own pace and thus reduces the performance hit we take from syncing.
The goal of this patch is for rsync to disturbe the filesystem cache as little as possible. Actively dropping data from the cache when it is not used anymore helps, but it can also be counterproductive, if the data had been in the cache before rsync even ran. In that case the data should not be touched.
So before rsync reads anything from a file it asks the kernel which pages of the file it already has in the cache. It will then only drop the pages that had not been in the cache before.
The new rsync functionality has been contributed to the rsync mainline and will appears in the patch directory of the rsync source archive along with the next version of rsync. The original patch for rsync 2.6.9 is available from here.
To see the amount of file system cache curently in use, run
> grep ^Cached: /proc/meminfo
To see the effect of a large write operation, use dd to generate a 67 MB file filled with zeros.
> dd if=/dev/zero bs=64k count=1024 of=largefile.tmp
1024+0 records in
1024+0 records out
67108864 bytes transferred in 0.753085 seconds (89111922 bytes/sec)
Now check the cache usage, remove the file and check the cache again.
> grep ^Cached: /proc/meminfo
Cached: 742340 kB
> rm largefile.tmp
> grep ^Cached: /proc/meminfo
Cached: 676792 kB
The difference in cache usage matches the file size quite closely. This
indicates that the whole file had been in the file system cache. This is also
the reason for the rather impressive transfer rate dd
has reported.
Now lets see what happens when running rsync on a large file.
> dd if=/dev/zero bs=64k count=1024 of=largefile.tmp
> grep ^Cached: /proc/meminfo
Cached: 742340 kB
> rsync largefile.tmp largefile2.tmp
> grep ^Cached: /proc/meminfo
Cached: 807876 kB
> rm largefile.tmp largefile2.tmp
Again the whole file landed in the cache. Twice actually. Once when it was
written by dd
and a second time when it was copied by rsync. So finally
lets do the same thing using the new rsync cache dropping feature.
> dd if=/dev/zero bs=64k count=1024 of=largefile.tmp
> grep ^Cached: /proc/meminfo
Cached: 741940 kB
> rsync --drop-cache largefile.tmp largefile2.tmp
> grep ^Cached: /proc/meminfo
Cached: 741940 kB
This time, the cache usage does not change at all. This is because rsync now
keeps largefile.tmp
in cache since it was already in the cache when rsync
was started, and it releases largefile2.tmp
since it is new and has not
been in the cache before. If largefile.tmp
had not been in cache when
rsync was started, then largefile.tmp
would not be in cache after the
rsync run either.
NOTE: The content of this website is accessible with any browser. The graphical design though relies completely on CSS2 styles. If you see this text, this means that your browser does not support CSS2. Consider upgrading to a standard conformant browser like Mozilla Firefox, Opera, Safari or Konqueror for example.