Sunday, September 14, 2008

GSoC is over, how did Generic GPU-Accelerated Video Decoding do?

So GSoC has come to a close, and this project was successful, in that there is a working XvMC implementation sitting in nouveau/mesa's gallium-0.1 branch. Currently the NV40 Gallium driver is the only one complete enough to run XvMC, and there are still a few missing features (no support for interlaced video, subpictures aren't implemented yet, only motion compensation is currently accelerated).

In my last entry I mentioned that I was hoping to spend the last part of GSoC getting IDCT working, but I came to realize that this would probably require more work than I initially estimated, due to the limited render target formats GPUs support. We decided that we may also want to take advantage of fixed function IDCT hardware if it is available, and one of the other Nouveau contributors had been looking into this on NV40, so I'm hoping we can take advantage of his efforts and get that into the Gallium NV40 driver in some fashion. Instead I spent the last two weeks of GSoC and the first two weeks of the rest of my life focusing on performance and cleaning up a few bugs here and there.

As far as performance goes, we managed to grab most of the low hanging fruit.

  • We buffer an entire frame of content and fire that off with a few draw calls. Most frames, depending on their content, can be done in two draw calls.

  • Because we have to fill buffers with new content each frame, we don't necessarily want to wait until the GPU is done with those buffers before we map and update them. Since we don't need their old contents we can just allocate a set of buffers and rotate them, double buffer style.

  • For P and B frames many blocks are composed entirely of pixels from the reference frame(s), so we don't technically need to upload any new data.

    Previously we would clear that block of the source texture to black, so that it didn't contribute anything to the destination block. However, for most P and B frames a significant number of blocks fall into this category, and most frames are P or B frames, so that's a lot of useless clearing on the CPU side and texel fetching on the GPU side.

    To get around this we clear the first such zero block of each frame for the luma and two chroma channels, and for subsequent zero blocks we texture from that first block. This saves a nice chunk of CPU time, but doesn't do much for GPU texture bandwidth.

    Once I figure out how TGSI expresses flow control constructs I'm hoping we can just set the texcoords for zero blocks to the negative range and conditionally tex fetch, but for older hardware which doesn't support conditional execution the current path should be good.
Having said all that however, 720p24 decoding is still not done in real time. It's kind of a mystery actually, because while the profiler seems to indicate that we are GPU limited rather than CPU limited, the numbers don't seem to add up. A 1280x720 video is composed of 80x45 macroblocks. Each macroblock is composed of 4 blocks, each block is rendered as two triangles, so thats 8 triangles per macroblock, or ~29K triangles per frame. At 24 fps thats ~696K tris/sec or ~2M vertices/sec. Nvidia quotes a GeForce 6200's vertex processing rate at 225M/sec. Our vertex shaders are very simple, we use screen aligned tris in normalized coords, so we don't have to do any significant transforming, just move inputs to outputs.

Similarly, a 1280x720 video is composed of ~922K pixels. At 24 fps we're rendering ~22M pixels/sec. In the worst case, each pixel requires 5 texel fetches (3 2-byte fetches, 2 4-byte fetches) and one 4-byte write to the frame buffer, so that brings us to 308M bytes/sec read and 88M bytes/sec write. The color conversion pass adds another 352M bytes/sec for read and 88M bytes/sec for write. Nvidia quotes a 6200's fillrate as 1.2-1.4B texels/sec, and assuming those texels are 32-bit, that works out to 4.8-5.6B bytes/sec. Again, our pixel shaders are not really complicated, mostly TEX2Ds, MULs, and ADDs. Omitting the tex fetching doesn't change much, neither does disabling color writes to the frame buffer. Regardless of how Nvidia calculates its marketing numbers we seem to be well below them so it probably doesn't matter how optimistic they are.

All in all it seems very odd that, given the above, a 854x480 clip renders in real time, but the same clip at 1280x720 takes 4x longer, despite only being 2.25x larger. I suspect that either there is a very non-obvious bug in the state tracker, or that we are doing something odd in the driver, possibly in the way we set up the 3D state, submit commands and data, or with memory management, or possibly with how we get our frame buffer onto the X window.

Either way, I hope to continue working on this now that GSoC is over, and anyone who is interested in contributing is free to do so. I'm hoping to move things over to Mesa's GIT sooner or later, and I'm curious to see how it does on Intel's hardware. I don't know how well the current rendering process fits with what Intel supports, but if single component signed 16-bit textures aren't a problem it should be very easy to get things up and running. At best all it needs is some minor changes in the Winsys layer.