Making a Wayland Screen Capturing Protocol

History

Now that the ext-image-copy-capture-v1 (previously known as ext-screencopy-v1) protocol has been merged, it is probably a good time to explain what it’s all about; but first a bit of history.

In October 2021, I created a draft for a new wayland protocol extension, called ext-screencopy-unstable-v1. It was very similar to wlr-screencopy-unstable-v1, (hereafter referred to as wlr-screencopy) but it added two new features that were missing from wlr-screencopy: full damage tracking and cursor capturing.

I would have been happy to extend wlr-screencopy or create a new protocol in the wlr name space, but wlr-protocols had recently been closed for further development. Thus, the only route available was to submit the protocol to the wayland-protocols standards body. At least, that was my understanding at the time.

It wasn’t the goal, to start with, to create a protocol for the greater wayland ecosystem. I just wanted to improve capturing performance and reduce perceived input latency when using a mouse over VNC. In any case, that’s now what we have. However, it is in the “ext” name space, so it is an optional extension. It would be nice if it were adopted by the likes of Mutter or KWin, but I wouldn’t hold my breath on that one.

The protocol went through many iterations over the span of roughly 2 years and we introduced a new auxiliary protocol named ext-image-source-v1 (hereafter referred to as image-source). This was a very clever innovation suggested by Simon Ser. An image-source is an opaque object from which images or a series of images can be captured. New protocol extensions can be made to ingest or produce image-sources. The image-source protocol brings the following to the table:

New screen capturing protocols don’t have to include different methods for capturing all sorts of capturable entities.
Existing screen capturing protocols need not be extended to add methods for capturing new capturable entities.
When implementing a protocol, capturable objects need not depend on ext-screencopy or any other capturing protocol; they just need to implement an image-source interface.

For a while, I was unable to inspire external interest in the project, so it stalled a bit. Frankly, I had given up on it. Making changes to the protocol wasn’t really very hard. The two most difficult bits were getting reviews and updating the wlroots implementation after each architectural change.

Finally, Simon Ser took some interest in the project, early this year, and he made some improvements to the protocol and provided a compositor-side implementation of his own. We renamed the protocol after a discussion that dragged on for a bit and it is now called ext-image-copy-capture-v1 and the image-source is called ext-image-capture-source-v1. With Simon’s help, the protocol was finally good enough and it has now been merged.

Differences

Anyhow, what is the point of this thing? To answer that question, let’s first take a look at the protocol that this is meant to replace: wlr-screencopy.

The wlr-screencopy protocol offers any privileged wayland client a way to capture an output (wayland term for a computer monitor). It does this by sending a pixel buffer to the compositor and asking it to fill in the buffer with the contents of the screen. Since version 3 of wlr-screencopy, the compositor can tell the capturing client which regions of the image changed since its last capture. This information can be passed on to video compressors to reduce processing time. Another feature of version 3 is that it’s possible to wait for a change before the compositor completes the capturing and returns the buffer to the client. Before, frames would be captured immediately upon request from the client.

The ext-image-copy-capture protocol adds the ability for the client to tell the compositor which regions of the client’s buffer need to be updated, so the compositor doesn’t need to fill in the regions that are already up to date. If the client has only one buffer, the buffer will always be up to date, but if the client wants to use double-buffered capture, the older buffer of the two will be missing the updates that are in the newest one. The client can now tell the compositor which parts are missing from the older buffer when submitting it for the next frame capture. This means that the compositor knows which parts of the image need not be copied to the buffer, so it doesn’t need to copy the whole frame. It only needs to copy the region that changed and the region that’s missing from the client’s buffer.

The ext-image-copy-capture protocol also adds cursor capturing. You can capture the cursor image, the hotspot (how you need to shift the image to put the pointy bit in the right place) and the position of the cursor. The allows VNC servers and other remote desktop software to render the cursor on the remote client’s side. This is useful because it allows people to move the cursor around freely without experiencing the latency of the network. It gives users the illusion of a more responsive system.

Lastly, the protocol’s architecture also allows for capturing of top-levels (wayland term for windows), and pretty much anything from which an image can be derived.

Credits

Many people had a few things to say about the protocol after I submitted it. There were 39 participants in the discussion on the merge request. I would like to give thanks to everyone who helped to improve upon the design. Special thanks go to Simon Ser for initiating the final push and for implementing the latest iteration in wlroots, Simon Zeni for reviewing the final iteration of the protocol and giving us the official stamp of approval, Victoria Brekenfeld for implementing the protocol in Cosmic and providing us with some valuable feedback, and Lynne for reviewing the first iteration of the protocol.