Introduction

Recently I’ve been working on a lock-free message queue implemented using shared memory. Some of the criteria I’ve set for it are that the queue…

  • … must be lock-free,
  • must allow multiple producers, but a single consumer will suffice,
  • should grow on demand, like C++’s std::vector,
  • there should be very little overhead from notifying receiving processes about new messages,
  • should not require threads,
  • and writers should know when there is no reader on the other end.

The last requirement is the subject of this article.

Before a process wishes to write to the queue, it first checks if the process is alive. If it is not alive, the API function shall set errno to EOWNERDEAD and return -1. The reader process might die sometime between the check and the actual writing to the queue, but this is not important. What’s important is that the writer should not be able to fill the queue with junk and expect a response from a dead process.

I might end up writing about the message queue some time later if I finish it but until then I shall write about what I consider to be one of the more entertaining aspects of the project. Update: I’m not finishing it.

I’ve considered and tried out a few implementations. Let’s look at some of them in closer detail.

Performance numbers in this article are based on benchmarks run on my Thinkpad X220 with an Intel i5-2520M, running Linux 4.19.12. The benchmark measures the time it takes to run 10000000 loops that increment a counter and call a function that carries out the action that is to be measured.

kill(pid, 0)

Calling kill(pid, sig) with sig = 0 will leave the process alone but return error if the process does not exist.

This is slow. I tried it and the throughput of the queue dropped by about a half.

It’s also racy, which is worse than being slow. The problem is that the pid might be reused, which means that the monitoring process won’t know that the monitored process went away. If you’re thinking that this is unlikely and we shouldn’t worry about it, you probably have not experienced Murphy’s Law in action.

open(“/proc/pid”, …)

To fix the race condition in kill(), one might follow the following procedure:

  • The process that is to be monitored opens its /proc/self directory,
  • and sends it to the monitoring process via Unix socket.
  • The monitor receives it and stores.

You can think of this file descriptor as a resource handle for the process’s aliveness. It remains valid as long as there is some process referencing it. A directory with the same name can reappear, but it will not be the same directory as far as the fd is concerned.

Now, when the monitor wants to check if the process is alive it can try to access any of the files in the /proc/pid directory that was sent to it by the process. It’s probably best/simplest to use faccessat(fd, "stat", R_OK, 0). If the process has died, the directory will be empty, so faccessat(fd, ...) will return error.

Sadly, this is about 5 times slower than using kill().

Polling in a thread

One of the criteria I set was that I’d rather not have to spawn threads. I’d like to run this in multiple processes on an embedded system. Spawning one or more threads for each process that only serve the message queue is not ideal.

Anyway, this would certainly solve the speed issue. With the /proc/pid trick and periodic polling this would definitely perform OK.

The writer process would get notified about the death of the reader as fast but this would probably be fast enough.

Connected Sockets

One approach that I tried was to maintain a SOCK_SEQPACKET Unix socket connection between the reader and the writer. With the help of poll(), select() and friends, you can be notified when the other end disconnects. The downside is that you have to call poll() for every read or write. The alternative is to expose the event handler in the API, but this complicates the API, but it does offer some performance improvement because you can do multiple reads and writes without the extra system call.

Anyway, with a simple interface, it does not perform much better than the kill() solution. Running epoll_wait() on an epoll fd that has no fds attached is only slightly faster than calling kill().

Pthread Mutex

It’s possible to use mutexes over shared memory. In order to do that the mutex must have the shared attribute set. However, this is not very useful on its own unless you can guarantee that no process is going to crash or be killed while holding a shared mutex lock. Such a guarantee does not exist, so we must also set the “robust” attribute.

The robust attribute makes it so that when the mutex is locked, the operating system is made aware of the fact that the lock has been acquired and which thread holds it. When a thread is killed while holding a robust mutex lock, the mutex is marked as inconsistent by the operating system. The next thread that tries to lock the mutex gets an error and must try to make the data consistent again and remove the mark. On Linux this is done using the robust futex list, but more on that later. The important bit of information here is that checking a mutex’s consistency does not require a system call.

How can this be (ab)used to monitor a single process’s aliveness? Well, one way I can think of is to have the monitored process acquire the lock and hold it until it exits. The monitor can then check if it is alive using pthread_mutex_trylock(). If the locking fails with errno = EBUSY, the process is alive and well. If it fails with errno = EOWNERDEAD or if the locking succeeds, the owner must have been killed or exited, respectively.

This is a lot faster than all of the above, but it is still a lot slower than loading data from a memory address, which is essentially free. It turns out that on x86 (at least on my Thinkpad X220) a single lock cmpxchg instruction is rather slow and it is what takes up most of the time spent in pthread_mutex_trylock() according to perf record. You can try running the benchmark with perf record and annotate the __pthread_mutex_trylock, to see what I’m talking about.

$ perf record ./benchmark mutex
$ perf report

Robust Futex List

On Linux, robust mutexes are implemented using the robust futex list. A “futex” is a system call that is used to suspend and wake up processes when locking and unlocking mutexes, respectively. The robust futex list is a linked list of futexes that the process is currently holding. When the process exits (for any reason), the operating system goes through the list, marks the futex lock words to signify that the owner died and wakes up those processes that are waiting on the futex. The lock word is a part of the mutex that’s stored in shared memory so any process that can see the mutex can check the FUTEX_OWNER_DIED bit (30) to see if the previous owner is still alive.

The robust futex list is not exposed in standard libraries such as glibc or musl, so the only standard and portable way to check it is to use pthread_mutex_trylock() as described in the previous solution. However, as we’ve seen, this is rather slow. The system calls get_robust_list and set_robust_list don’t even have glibc wrappers. It is clear that the regular application programmer was never supposed to touch this. Still, I’m tempted.

Armed with the knowledge of robust futex lists, how can we hack the mutex so that we can check the FUTEX_OWNER_DIED bit without pthread_mutex_trylock()?

Adding To The List

The first thing I thought of was to modify the list on my own using get_robust_list() and set_robust_list(). It turns out that this is not a very good option because the standard library is free to implement the linked list as it sees fit with the only constraint that there must be a “next” link at the start of every node. Both glibc and musl have doubly linked lists. Unless I can replicate the structure exactly, adding my own node will just throw off the standard library. It is not a good idea to copy the structure, as it may change in the futures and may very well differ between systems.

Spawning a Dedicated Thread

Each thread has its own robust list. If I spawn a thread who’s only purpose in life is to supply a pristine robust list, the issue with the first solution is averted. I tried this using the clone() system call with a minimal stack size and it worked as expected. See: procmon.c. I don’t like having a thread around that does nothing, but this solution would have been good enough had my subconscious not come up with another one while I was thinking about something else.

Extracting The Mutex’s Lock word

The list node is located somewhere within the mutex. The lock word can be extracted with the following procedure:

  • Lock the mutex.
  • get_robust_list().
  • Find the node in the list that is contained within the mutex.
  • The lock word is located at the node address plus the futex offset.

The futex offset is defined in the robust list head and it is the offset of the lock word from the start of the list node.

This is how it’s done:

static long get_robust_list(int tid, struct robust_list_head** head_ptr,
			    size_t* len_ptr)
{
	return syscall(SYS_get_robust_list, tid, head_ptr, len_ptr);
}

static long find_lock_word_offset(pthread_mutex_t* mutex)
{
	struct robust_list_head* head;
	size_t len;

	int rc = get_robust_list(0, &head, &len);
	if (rc != 0)
		return -1;

	assert(len == sizeof(struct robust_list_head));

	uintptr_t node_addr, mutex_addr = (uintptr_t)mutex;
	struct robust_list* node;

	for (node = head->list.next; node != &head->list; node = node->next) {
		node_addr = (uintptr_t)node;

		if (mutex_addr <= node_addr
		 && node_addr < mutex_addr + sizeof(*mutex))
			break;
	}

	if (node == &head->list)
		return -1;

	return (long)(node_addr - mutex_addr) + head->futex_offset;
}

Checking for aliveness on the other end goes something like this:

int is_reader_alive(struct my_shm* shm)
{
	uint32_t* addr = (uint32_t*)((intptr_t)&shm->mutex
				     + shm->lockword_offset);
	uint32_t lock_word = atomic_load(addr);
	return lock_word && !(lock_word & (1 << 30));
}

A relaxed __atomic_load() yields a single regular instruction, so there is no faster way to do it and the aliveness check becomes “free”, essentially. Also, the implementation is quite short and I would say it’s even rather simple. But, is this kind of a hack something that should be used in production code?

Benchmarks

The following benchmarks are from my Thinkpad X220. The time column shows the time that it takes to run 10000000 loops. See benchmarks.c.

  Normalised Time Explanation
dummy 1.0 0.04 s Do nothing
load 1.0 0.04 s Loading from a memory address
mutex 3.5 0.14 s pthread_mutex_trylock()
epoll 32.0 1.28 s Running epoll on an empty fd
kill 44.5 1.78 s kill(pid, 0)
procfs 204.8 8.19 s faccessat() on /proc/pid