Sunday, January 18, 2009

Yes I'm still decoding video using shaders

It's been a while since I've said much about my video decoding efforts, but there are two pieces of good news to share. Both are improvements to Nouveau in general, not specific to video decoding.

First, we can now load 1080p clips. Thanks to a very small addition to Gallium and a few lines of code in the Nouveau winsys, a lot of brittle code was removed from the state tracker and memory allocations for incoming data are now dynamic and only done as necessary. The basic situation is we allocate a frame-sized buffer, map it, fill it, unmap it, and use it. On the next frame we map it again, fill it again, and so on. But what if the GPU is still processing the first frame? The second time we attempt to map it the driver will have to stall and wait until the GPU is done before it can let us overwrite the contents of the buffer.

But do we have to wait? Not really, we don't need the previous contents of the buffer, we're going to overwrite the whole thing anyway, so we just need a buffer that we can map immediately. To get around this we were allocating N buffers at startup and rotating between them; filling buffer 0, then 1, and so on, which reduced the likelyhood of hitting a busy buffer. The problem with that is obvious, for high res video we need a ton of extra space, most of it not being used most of the time. Now if we try to map a busy buffer, the driver will allocate a new buffer under the covers if possible and point our buffer to it, deleting the old buffer when the GPU is done with it. If the GPU is fast enough and processes buffers before you attempt to map them again, everything is good and you'll have the minimum number of buffers at any given time. If not, you'll get new buffers as necessary, in the worst case until you run out of memory, in which case you'll get stalls when mapping. The best of both worlds.

The second bit of good news is that we've managed to figure out how to use swizzled surfaces, which gave a very large performance boost. Up to now we've been using linear surfaces everywhere, which are not very cache or prefetch friendly. Rendering to swizzled surfaces during the motion compensation stage lets my modest AthonXP 1.5 GHz + GeForce 6200 machine handle 720p with plenty of CPU to spare. 1080p still bogs the GPU down, but the reason for that is pretty clear: we still render to a linear back buffer and copy to a linear front buffer. We can't swizzle our back or front buffers, so the next step will be to figure out how to get tiled surfaces working, which are similar, but can be used for back and front buffers. Hopefully soon we can tile the X front buffer and DRI back buffers and get a good speed boost everywhere, but because of the way tiled surfaces seem to work (on NV40 at least) I suspect it will require a complete memory manager to do it neatly.

Beyond that there are still a few big optimizations that we can implement for video decoding (conditional tex fetching, optimized block copying, smarter vertex pos/texcoord generation, etc), but the big boost we got from swizzling gives me a lot of optimism that using shaders for at least part of the decoding process can be a big win. It probably won't beat dedicated hardware, but for formats not supported by hardware, or for decoding more than one stream at a time, we can probably do a lot of neat things in time.

I've also been looking at VDPAU, which seems like a nice API but will require a lot of work to support on cards that don't have dedicated hardware. More on that later maybe.

Sunday, September 14, 2008

GSoC is over, how did Generic GPU-Accelerated Video Decoding do?

So GSoC has come to a close, and this project was successful, in that there is a working XvMC implementation sitting in nouveau/mesa's gallium-0.1 branch. Currently the NV40 Gallium driver is the only one complete enough to run XvMC, and there are still a few missing features (no support for interlaced video, subpictures aren't implemented yet, only motion compensation is currently accelerated).

In my last entry I mentioned that I was hoping to spend the last part of GSoC getting IDCT working, but I came to realize that this would probably require more work than I initially estimated, due to the limited render target formats GPUs support. We decided that we may also want to take advantage of fixed function IDCT hardware if it is available, and one of the other Nouveau contributors had been looking into this on NV40, so I'm hoping we can take advantage of his efforts and get that into the Gallium NV40 driver in some fashion. Instead I spent the last two weeks of GSoC and the first two weeks of the rest of my life focusing on performance and cleaning up a few bugs here and there.

As far as performance goes, we managed to grab most of the low hanging fruit.

  • We buffer an entire frame of content and fire that off with a few draw calls. Most frames, depending on their content, can be done in two draw calls.

  • Because we have to fill buffers with new content each frame, we don't necessarily want to wait until the GPU is done with those buffers before we map and update them. Since we don't need their old contents we can just allocate a set of buffers and rotate them, double buffer style.

  • For P and B frames many blocks are composed entirely of pixels from the reference frame(s), so we don't technically need to upload any new data.

    Previously we would clear that block of the source texture to black, so that it didn't contribute anything to the destination block. However, for most P and B frames a significant number of blocks fall into this category, and most frames are P or B frames, so that's a lot of useless clearing on the CPU side and texel fetching on the GPU side.

    To get around this we clear the first such zero block of each frame for the luma and two chroma channels, and for subsequent zero blocks we texture from that first block. This saves a nice chunk of CPU time, but doesn't do much for GPU texture bandwidth.

    Once I figure out how TGSI expresses flow control constructs I'm hoping we can just set the texcoords for zero blocks to the negative range and conditionally tex fetch, but for older hardware which doesn't support conditional execution the current path should be good.
Having said all that however, 720p24 decoding is still not done in real time. It's kind of a mystery actually, because while the profiler seems to indicate that we are GPU limited rather than CPU limited, the numbers don't seem to add up. A 1280x720 video is composed of 80x45 macroblocks. Each macroblock is composed of 4 blocks, each block is rendered as two triangles, so thats 8 triangles per macroblock, or ~29K triangles per frame. At 24 fps thats ~696K tris/sec or ~2M vertices/sec. Nvidia quotes a GeForce 6200's vertex processing rate at 225M/sec. Our vertex shaders are very simple, we use screen aligned tris in normalized coords, so we don't have to do any significant transforming, just move inputs to outputs.

Similarly, a 1280x720 video is composed of ~922K pixels. At 24 fps we're rendering ~22M pixels/sec. In the worst case, each pixel requires 5 texel fetches (3 2-byte fetches, 2 4-byte fetches) and one 4-byte write to the frame buffer, so that brings us to 308M bytes/sec read and 88M bytes/sec write. The color conversion pass adds another 352M bytes/sec for read and 88M bytes/sec for write. Nvidia quotes a 6200's fillrate as 1.2-1.4B texels/sec, and assuming those texels are 32-bit, that works out to 4.8-5.6B bytes/sec. Again, our pixel shaders are not really complicated, mostly TEX2Ds, MULs, and ADDs. Omitting the tex fetching doesn't change much, neither does disabling color writes to the frame buffer. Regardless of how Nvidia calculates its marketing numbers we seem to be well below them so it probably doesn't matter how optimistic they are.

All in all it seems very odd that, given the above, a 854x480 clip renders in real time, but the same clip at 1280x720 takes 4x longer, despite only being 2.25x larger. I suspect that either there is a very non-obvious bug in the state tracker, or that we are doing something odd in the driver, possibly in the way we set up the 3D state, submit commands and data, or with memory management, or possibly with how we get our frame buffer onto the X window.

Either way, I hope to continue working on this now that GSoC is over, and anyone who is interested in contributing is free to do so. I'm hoping to move things over to Mesa's GIT sooner or later, and I'm curious to see how it does on Intel's hardware. I don't know how well the current rendering process fits with what Intel supports, but if single component signed 16-bit textures aren't a problem it should be very easy to get things up and running. At best all it needs is some minor changes in the Winsys layer.

Saturday, August 2, 2008

IDCT vs. the GPU

I've come to understand a few things while talking to Stephane (marcheu) and trying to come up with a (hopefully fast) way of performing IDCT on a typical GPU: 1) It doesn't fit nearly as well as motion compensation does, and 2) it wouldn't necessarily take a radical departure from current designs to make it fit, just a few adjustments here and there. I think the second point is the more frustrating of the two.

The problems so far are as follows:

  1. The input format, signed 12-bit integers, isn't amongst your GPU's favourite texture formats. We're pretty fortunate that signed 16-bit integers are available at least, even though we have to renormalize to 12 bits. Unfortunately signed 16-bit integers are usually not available with four components, which leads to...

  2. Hefty texel fetch requirements. If we had signed 16-bit RGBA textures we could do some packing and cut down on the number of fetches by 4, but we don't. Therefore, a naive 2D IDCT would require 128 texel fetches and 64 MADDs per pixel. Even if we had the ability to pack our DCT coefficients into 4 components we would still be left with 32 texel fetches, 16 DP4s, and 15 additions. There are algorithms such as AAN, which trade multiplications for additions, but are not suited to being calculated a pixel at a time or taking advantage of our vector arithmetic units. Instead of a 2D IDCT we could opt for two 1D IDCTs, which would give us 16 texel fetches and 8 MADDs per pixel per pass, or 4 texel fetches, 2 DP4s, and 1 addition if we could do full packing. However, we really can't do 1D IDCTs efficiently because...

  3. We can't render to signed 16-bit buffers of any sort, so we have to find something to do with the intermediate result. The only alternative at the moment would be to render to floating point buffers, but then we get lose a whole group of GPUs that can't render to FP buffers or can but do it very slowly. But even if we manage to do a respectable IDCT, we have one more issue to deal with...

  4. The output of our IDCT has to be easily consumable during our motion compensation pass. Currently motion compensation is a very good fit for the GPU model, however if we have to jump through hoops to fit IDCT into the picture we don't want to make the output too cumbersome to use during motion compensation, which from an acceleration POV is the more important stage to offload.

Having said all of that, what we need to be able to do a fast IDCT isn't so much.

  • I think simply having support for signed 16-bit RGBA textures would go a long way to making IDCT fit. We do have support for signed 16-bit two component textures, so at least we're half way there.

  • Signed 16-bit render targets would also be helpful, although going forward FP16 and FP32 support are probably better to target.

  • The dream of every programmer who has ever used SIMD instructions, the horizontal addition, would also be very useful.

  • Being able to swizzle and mask components as part of a texel fetch would also help, since we receive planar data and have to pack it at some point.

However, we don't have any of that, as far as I know. What we do have is an MC implementation that currently fits very well and still has room for improvement, so at least that's a bright spot. I also have some promising ideas to tackle IDCT and its issues, and 3-4 weeks to figure it out.

Monday, July 21, 2008

Up and running on real hardware

I reached a nice milestone today: working playback on my Geforce 6200. Most of the work went into the winsys layer, with some bug fixes and workarounds in other places, but everything is up and running now. Unfortunately the output isn't perfect, there is some slight corruption here and there. I'm guessing it has to do with some dodgy assumptions I made about shader arithmetic (rounding, saturation, etc) that SoftPipe went along with but the GPU didn't. The other issue is that there is some severe slowdown when any 2D drawing happens on the rest of the desktop. I'm guessing this may be due locking when copying the backbuffer to the window, or maybe I'm completely soaking up the CPU.

Currently nothing is optimized, I'm not even turning on compiler optimization, and I have a really slow prototype IDCT implementation performed on the CPU in place of the hardware version, so I'm sure I'm eating up a lot more CPU time than I will be by the end of the summer. I have a lot of different ideas on optimization that will target CPU usage and GPU fillrate usage, but given that I get almost full speed playback currently, I'm pretty confident that I'll be able to get HD playback by the end of SoC.

As far as the winsys goes, I was able to use most of the current Nouveau winsys. Unfortunately the DRI stuff is buried within Mesa right now, so I had to extract a lot of things and create a standalone library to handle screens, drawables, the SAREA, etc. to be able to use DRI without including and linking with half of Mesa. The winsys interface is also simpler than Mesa's; there are only a few client calls, the backbuffer is handled in the state tracker, and the winsys doesn't have to create or call into the state tracker. It took me a while to realize why the Mesa winsys was set up the way it was, and that I could simpify things on my end.

Here are some screen grabs:

Coffee mug containing two pens and a feather. Woman on the phone. Windmill in the middle of a field of yellow flowers.
Two cartoon characters (desktop visible). Fighter jet flying through a clear sky (desktop visible).

Sunday, June 29, 2008

One hurdle down, many more to go

Just a quick update; after some re-reading of the MPEG2 spec, debugging, and clean up I've finally got correct output for the MC stage for progressive video clips that use frame based and field based motion prediction. There are two other motion prediction methods, 16x8 and dual-prime, but they don't seem to be too common and shouldn't be too hard to implement anyway. It took a bit of tweaking, but comparing the output to that of other media players I see no difference, which means one hurdle down. Next steps are to revisit IDCT and start working with real hardware.

Here are some screen grabs from various test clips:

Construction site on a field.
Windmill in the middle of a field of yellow flowers.
Coffee mug containing two pens and a feather.
Woman on the phone.

Thursday, June 26, 2008

Progress

I put some work into getting field-based prediction working, and I think I have it mostly right. I ran into what I think is a bug in SoftPipe, which has to do with locking and updating textures. For some reason the surface and texture cache does not get invalidated in such cases, leading to stale texels being read and displayed. I manually flush the texture cache after mapping textures, and that seems to take care of it. It took a lot of debugging to track that one down and is probably fixed upstream, but at least it's another issue out of the way. At the moment some macroblocks are still not rendered correctly, but I'm hoping to get those out of the way.

The one thing I really can't stand is writing shader code for Gallium. The amount of C code you need write to generate a token stream for even a simple shader is obscene. Currently I have 12 shaders and each is about 200-300 lines of code for 10-15 shader instructions, so most of that code is noise. On more than one occasion I've made changes to the wrong shader just because it's so hard to wade through the code. What I wouldn't do for a simple TGSI assembler right about now. I'll have to do something about that, it's a huge eye sore.

It's not surprising that I'm a little behind on my schedule. I started on IDCT a while back but put that code down to focus on MC. Luckily IDCT isn't strictly necessary as XvMC allows for MC-only acceleration, so I can test things and move forward on MC without having to worry about IDCT. I'm hoping the next step of the project, getting things running on real hardware, will be as painless as possible allowing me to get IDCT working. However, considering all the little unforseen issues that have cropped up with SoftPipe I wouldn't be surprised if I ran into more of the same with the Nouveau driver.

Monday, June 9, 2008

Moving along

Things are moving along in the right direction. I finally got a chance to push my work to date to Nouveau's mesa git, you can check it out here. I have I, P, and B macroblocks working correctly when rendering frame pictures and using frame-based motion compensation. All that's left is to implement is field-based motion compensation (which is surprisingly very common, even in progressive content), and rendering field-based pictures (i.e. interlaced content). I think I've figured out a way to efficiently render macroblocks that use field-based prediction in one pass. Frame-based prediction works by grabbing a macroblock from a previously rendered surface and adding a difference to form the new macroblock. Field-based prediction works the same way, but references two macroblocks on the previously rendered surface, one for even scanlines and other for odd. My plan is to read from both reference macroblocks every scanline and choose which one to keep based on whether or not the scanline is even or odd. This can easily be done with a lerp(). It would be preferable to avoid the unecessary texture read, but it's simple and works in a single pass. Other alternatives include rendering the macroblock twice (once with even scanlines only, then with odd scanlines, using texkill to discard alternating scanlines), and rendering even and odd scanlines using line lists (which I understand makes sub-optimal usage of various caches in the pixel pipeline).