Checking Frame Damage on the GPU
Yesterday I finished some changes to my VNC server that will offload some of the damage checking to the GPU. This doesn’t really improve performance much but it does put c.a. 10% less load on a single CPU core on my system. Generally, my CPU isn’t very loaded, except when I’m compiling code. I don’t play computer games, so I imagine that my GPU spends most of its time sleeping. It is probably also true of people who run VNC, that they don’t play video games that much, at least not while they’re running VNC. Video gaming over VNC is probably never going to take off. Why not put the GPU to some good use?
What is damage checking? You may ask. Damage checking is comparing a new frame (image) to the last frame to see which pixels have changed between them, i.e. which parts of the framebuffers are “damaged” and must be redrawn. This is done to reduce network traffic, but it is also useful in that it reduces the amount of memory that needs to be accessed by the encoding algorithm. Here’s some pseudo code to express this in its simplest form:
for (y = 0; y < height; y++)
for (x = 0; x < width; y++)
if (old[x, y] != new[x, y])
mark_damaged(x, y)
As you can probably imagine, doing this on the CPU is not going to be cheap.
For the average modern monitor, the memory that this is going to touch will be
at least 1920 ⋅ 1080 ⋅ 41 ⋅ 2 ≈ 16MB!
This number does not include the buffer
to store the damage into. If it is stored in a bitmap, it will take
1920 ⋅ 1080 / 8 ≈ 250kB, which is not so bad,
but probably precludes any SIMD
magic taking place in this algorithm. Doing something like
damage[x, y] = old[x, y] != new[x, y]
might be auto-vectorized into something
decent if damage is implemented as a byte array, but not bit map, as that would
require state to be carried from previous iterations.
When the damage has been found, we must find some contiguous regions within it,
so that those regions can be encoded and sent to the client. For this I use
pixman, which is a pixel manipulation library. The function that I use to mark
damaged regions (pixman_region_union_rect
) does non-trivial amount of work
including memory allocations. Implementing mark_damaged()
in the example above
using pixman_region_union_rect(&damage, &damage, x, y, 1, 1)
is not a good
idea.
Knowing the damage on per-pixel basis isn’t actually very useful, nor is it practical, as we’ve seen. It is even detrimental to the encoding efficiency to have too fine-grained regions since encoding each distinct region carries some overhead. One approach, which is what I use, is to split the image into tiles. Each tile is a 32×32 pixel region. There is nothing scientific about this number really; it’s just something that works well enough. Now a simple algorithm may look like this:
for (y = 0; y < height; y += 32)
for (x = 0; x < width; x += 32)
if (frame1[x:x+32, y:y+32] != frame2[x:x+32, y:y+32])
mark_damaged(x, y, 32, 32)
Something like this is what wayvnc has been doing via NeatVNC for a while now. See damage.c.
It is probable that the naive algorithm above would actually perform better
than the tiled one given a more suitable mark_damaged()
function combined
with one that turns the resulting bit map or byte map into pixman_region
.
I have not explored this as there are other ventures far likelier to yield
better results. Both approaches are pretty poor when it comes to conserving
memory bandwidth. We still have to check all the pixels. That doesn’t change.
Wayvnc can do frame capturing via Linux DMA-BUFs. In short, they are resources that are represented by file descriptors that can be passed between processes. A GPU memory region can be represented by such an entity, so they can be sent from the Wayland compositor to the VNC server. Copying things from the GPU is pretty expensive too, as it requires the CPU to reach into the GPU’s memory and grab 8MB of data. This happens in the compositor when you use the “wlr-screencopy” protocol. It writes the resulting frame into shared memory, but slows down the compositor while it does so. It is better to have the client do the copying because this leaves the compositor free to do other things.
Because the frames are already on the GPU when they arrive, it definitely pays off to do some pre-processing on the GPU before passing the data on to the CPU. In fact, the simple naive damage checking algorithm can be implemented in GLSL shader language like this:
precision mediump float;
uniform sampler2D u_tex0;
uniform sampler2D u_tex1;
varying vec2 v_tex_coord;
void main()
{
float r = float(texture2D(u_tex0, v_tex_coord).rgb != texture2D(u_tex1, v_tex_coord).rgb);
gl_FragColor = vec4(r);
}
And if it is rendered into a single channel framebuffer object (i.e. only the red component), the memory that needs to be copied to get it to the CPU will be one quarter of the memory required to copy a whole buffer.
Now, the actual image also needs to be copied whole, or does it? In
OpenGL ES 2.0, pixel data can be copied using the glReadPixels()
function.
It is limited in that only the height and the location on the vertical axis
may be varied when selecting a region within the frame to copy, but the width
must always be the same as the width of the source buffer. With this in mind,
it is actually possible to make some crude adjustments. Because the damage
has already been rendered and copied, that information can be used to derive
which parts of the y-axis have been damaged, so one trick that can save a lot
of copying is to just copy a ribbon region across the screen that contains
all the damage. This is how it is currently done in wayvnc.
This helps when there are small changes, but for whole-screen changes or video,
it makes no difference.
Even though the damage frame is now only 2MB, cycling through it still shows up
close to the top when running perf record
& perf report
. It’s nowhere near
as inefficient as before, but it’s still up there. What more can be done? Well,
why don’t we let the GPU handle the tiling for us? A simplified version of the
shader might look like this:
precision mediump float;
uniform sampler2D u_tex0;
uniform sampler2D u_tex1;
uniform vec2 u_tex_size;
varying vec2 v_tex_coord;
bool is_pixel_damaged(vec2 pos)
{
return texture2D(u_tex0, pos).rgb != texture2D(u_tex1, pos).rgb;
}
bool is_region_damaged(vec2 pos)
{
bool r = false;
for (int y = -16; y < 16; ++y)
for (int x = -16; x < 16; ++x) {
float px = float(x) + v_tex_coord.x;
float py = float(y) + v_tex_coord.y;
if (is_pixel_damaged(vec2(px, py) / u_tex_size))
r = true;
}
return r;
}
void main()
{
float r = float(is_region_damaged(v_texture));
gl_FragColor = vec4(r);
}
This can be sampled into a much smaller framebuffer object: one that is
⌈1920 ⋅ 1080 / 32 / 32⌉ = 1922 bytes.
That’s a size not even worth worrying
about. And as expected, neither copying it to CPU nor going through it shows up
in perf
. This last change hasn’t made it into wayvnc yet, but it will be there
as soon as I clean up the changes.
I’ll leave benchmarking as an exercise for the reader. Have fun!
There will be a second blog post soon on this subject where I go into more details regarding the shaders and maybe I’ll do some benchmarking. Stay tuned!
-
Because RGBA. Each color component takes 8 bits. And the alpha channel which controls opacity also takes 8 bits. It’s not used here, but it pays to have it to keep memory access aligned on a 32 bit boundary. ↩