Checking Frame Damage on the GPU

Yesterday I finished some changes to my VNC server that will offload some of the damage checking to the GPU. This doesn’t really improve performance much but it does put c.a. 10% less load on a single CPU core on my system. Generally, my CPU isn’t very loaded, except when I’m compiling code. I don’t play computer games, so I imagine that my GPU spends most of its time sleeping. It is probably also true of people who run VNC, that they don’t play video games that much, at least not while they’re running VNC. Video gaming over VNC is probably never going to take off. Why not put the GPU to some good use?

What is damage checking? You may ask. Damage checking is comparing a new frame (image) to the last frame to see which pixels have changed between them, i.e. which parts of the framebuffers are “damaged” and must be redrawn. This is done to reduce network traffic, but it is also useful in that it reduces the amount of memory that needs to be accessed by the encoding algorithm. Here’s some pseudo code to express this in its simplest form:

for (y = 0; y < height; y++)
	for (x = 0; x < width; y++)
		if (old[x, y] != new[x, y])
			mark_damaged(x, y)

As you can probably imagine, doing this on the CPU is not going to be cheap. For the average modern monitor, the memory that this is going to touch will be at least 1920 ⋅ 1080 ⋅ 4¹ ⋅ 2 ≈ 16MB! This number does not include the buffer to store the damage into. If it is stored in a bitmap, it will take 1920 ⋅ 1080 / 8 ≈ 250kB, which is not so bad, but probably precludes any SIMD magic taking place in this algorithm. Doing something like damage[x, y] = old[x, y] != new[x, y] might be auto-vectorized into something decent if damage is implemented as a byte array, but not bit map, as that would require state to be carried from previous iterations.

When the damage has been found, we must find some contiguous regions within it, so that those regions can be encoded and sent to the client. For this I use pixman, which is a pixel manipulation library. The function that I use to mark damaged regions (pixman_region_union_rect) does non-trivial amount of work including memory allocations. Implementing mark_damaged() in the example above using pixman_region_union_rect(&damage, &damage, x, y, 1, 1) is not a good idea.

Knowing the damage on per-pixel basis isn’t actually very useful, nor is it practical, as we’ve seen. It is even detrimental to the encoding efficiency to have too fine-grained regions since encoding each distinct region carries some overhead. One approach, which is what I use, is to split the image into tiles. Each tile is a 32×32 pixel region. There is nothing scientific about this number really; it’s just something that works well enough. Now a simple algorithm may look like this:

for (y = 0; y < height; y += 32)
	for (x = 0; x < width; x += 32)
		if (frame1[x:x+32, y:y+32] != frame2[x:x+32, y:y+32])
			mark_damaged(x, y, 32, 32)

Something like this is what wayvnc has been doing via NeatVNC for a while now. See damage.c.

It is probable that the naive algorithm above would actually perform better than the tiled one given a more suitable mark_damaged() function combined with one that turns the resulting bit map or byte map into pixman_region. I have not explored this as there are other ventures far likelier to yield better results. Both approaches are pretty poor when it comes to conserving memory bandwidth. We still have to check all the pixels. That doesn’t change.

Wayvnc can do frame capturing via Linux DMA-BUFs. In short, they are resources that are represented by file descriptors that can be passed between processes. A GPU memory region can be represented by such an entity, so they can be sent from the Wayland compositor to the VNC server. Copying things from the GPU is pretty expensive too, as it requires the CPU to reach into the GPU’s memory and grab 8MB of data. This happens in the compositor when you use the “wlr-screencopy” protocol. It writes the resulting frame into shared memory, but slows down the compositor while it does so. It is better to have the client do the copying because this leaves the compositor free to do other things.

Because the frames are already on the GPU when they arrive, it definitely pays off to do some pre-processing on the GPU before passing the data on to the CPU. In fact, the simple naive damage checking algorithm can be implemented in GLSL shader language like this:

precision mediump float;

uniform sampler2D u_tex0;
uniform sampler2D u_tex1;

varying vec2 v_tex_coord;

void main()
{
	float r = float(texture2D(u_tex0, v_tex_coord).rgb != texture2D(u_tex1, v_tex_coord).rgb);
	gl_FragColor = vec4(r);
}

And if it is rendered into a single channel framebuffer object (i.e. only the red component), the memory that needs to be copied to get it to the CPU will be one quarter of the memory required to copy a whole buffer.

Now, the actual image also needs to be copied whole, or does it? In OpenGL ES 2.0, pixel data can be copied using the glReadPixels() function. It is limited in that only the height and the location on the vertical axis may be varied when selecting a region within the frame to copy, but the width must always be the same as the width of the source buffer. With this in mind, it is actually possible to make some crude adjustments. Because the damage has already been rendered and copied, that information can be used to derive which parts of the y-axis have been damaged, so one trick that can save a lot of copying is to just copy a ribbon region across the screen that contains all the damage. This is how it is currently done in wayvnc. This helps when there are small changes, but for whole-screen changes or video, it makes no difference.

Even though the damage frame is now only 2MB, cycling through it still shows up close to the top when running perf record & perf report. It’s nowhere near as inefficient as before, but it’s still up there. What more can be done? Well, why don’t we let the GPU handle the tiling for us? A simplified version of the shader might look like this:

precision mediump float;

uniform sampler2D u_tex0;
uniform sampler2D u_tex1;

uniform vec2 u_tex_size;

varying vec2 v_tex_coord;

bool is_pixel_damaged(vec2 pos)
{
	return texture2D(u_tex0, pos).rgb != texture2D(u_tex1, pos).rgb;
}

bool is_region_damaged(vec2 pos)
{
	bool r = false;

	for (int y = -16; y < 16; ++y)
		for (int x = -16; x < 16; ++x) {
			float px = float(x) + v_tex_coord.x;
			float py = float(y) + v_tex_coord.y;

			if (is_pixel_damaged(vec2(px, py) / u_tex_size))
				r = true;
		}

	return r;
}

void main()
{
	float r = float(is_region_damaged(v_texture));
	gl_FragColor = vec4(r);
}

This can be sampled into a much smaller framebuffer object: one that is ⌈1920 ⋅ 1080 / 32 / 32⌉ = 1922 bytes. That’s a size not even worth worrying about. And as expected, neither copying it to CPU nor going through it shows up in perf. This last change hasn’t made it into wayvnc yet, but it will be there as soon as I clean up the changes.

I’ll leave benchmarking as an exercise for the reader. Have fun!

There will be a second blog post soon on this subject where I go into more details regarding the shaders and maybe I’ll do some benchmarking. Stay tuned!

Because RGBA. Each color component takes 8 bits. And the alpha channel which controls opacity also takes 8 bits. It’s not used here, but it pays to have it to keep memory access aligned on a 32 bit boundary. ↩