Saturday, August 2, 2008

IDCT vs. the GPU

I've come to understand a few things while talking to Stephane (marcheu) and trying to come up with a (hopefully fast) way of performing IDCT on a typical GPU: 1) It doesn't fit nearly as well as motion compensation does, and 2) it wouldn't necessarily take a radical departure from current designs to make it fit, just a few adjustments here and there. I think the second point is the more frustrating of the two.

The problems so far are as follows:

  1. The input format, signed 12-bit integers, isn't amongst your GPU's favourite texture formats. We're pretty fortunate that signed 16-bit integers are available at least, even though we have to renormalize to 12 bits. Unfortunately signed 16-bit integers are usually not available with four components, which leads to...

  2. Hefty texel fetch requirements. If we had signed 16-bit RGBA textures we could do some packing and cut down on the number of fetches by 4, but we don't. Therefore, a naive 2D IDCT would require 128 texel fetches and 64 MADDs per pixel. Even if we had the ability to pack our DCT coefficients into 4 components we would still be left with 32 texel fetches, 16 DP4s, and 15 additions. There are algorithms such as AAN, which trade multiplications for additions, but are not suited to being calculated a pixel at a time or taking advantage of our vector arithmetic units. Instead of a 2D IDCT we could opt for two 1D IDCTs, which would give us 16 texel fetches and 8 MADDs per pixel per pass, or 4 texel fetches, 2 DP4s, and 1 addition if we could do full packing. However, we really can't do 1D IDCTs efficiently because...

  3. We can't render to signed 16-bit buffers of any sort, so we have to find something to do with the intermediate result. The only alternative at the moment would be to render to floating point buffers, but then we get lose a whole group of GPUs that can't render to FP buffers or can but do it very slowly. But even if we manage to do a respectable IDCT, we have one more issue to deal with...

  4. The output of our IDCT has to be easily consumable during our motion compensation pass. Currently motion compensation is a very good fit for the GPU model, however if we have to jump through hoops to fit IDCT into the picture we don't want to make the output too cumbersome to use during motion compensation, which from an acceleration POV is the more important stage to offload.

Having said all of that, what we need to be able to do a fast IDCT isn't so much.

  • I think simply having support for signed 16-bit RGBA textures would go a long way to making IDCT fit. We do have support for signed 16-bit two component textures, so at least we're half way there.

  • Signed 16-bit render targets would also be helpful, although going forward FP16 and FP32 support are probably better to target.

  • The dream of every programmer who has ever used SIMD instructions, the horizontal addition, would also be very useful.

  • Being able to swizzle and mask components as part of a texel fetch would also help, since we receive planar data and have to pack it at some point.

However, we don't have any of that, as far as I know. What we do have is an MC implementation that currently fits very well and still has room for improvement, so at least that's a bright spot. I also have some promising ideas to tackle IDCT and its issues, and 3-4 weeks to figure it out.

Monday, July 21, 2008

Up and running on real hardware

I reached a nice milestone today: working playback on my Geforce 6200. Most of the work went into the winsys layer, with some bug fixes and workarounds in other places, but everything is up and running now. Unfortunately the output isn't perfect, there is some slight corruption here and there. I'm guessing it has to do with some dodgy assumptions I made about shader arithmetic (rounding, saturation, etc) that SoftPipe went along with but the GPU didn't. The other issue is that there is some severe slowdown when any 2D drawing happens on the rest of the desktop. I'm guessing this may be due locking when copying the backbuffer to the window, or maybe I'm completely soaking up the CPU.

Currently nothing is optimized, I'm not even turning on compiler optimization, and I have a really slow prototype IDCT implementation performed on the CPU in place of the hardware version, so I'm sure I'm eating up a lot more CPU time than I will be by the end of the summer. I have a lot of different ideas on optimization that will target CPU usage and GPU fillrate usage, but given that I get almost full speed playback currently, I'm pretty confident that I'll be able to get HD playback by the end of SoC.

As far as the winsys goes, I was able to use most of the current Nouveau winsys. Unfortunately the DRI stuff is buried within Mesa right now, so I had to extract a lot of things and create a standalone library to handle screens, drawables, the SAREA, etc. to be able to use DRI without including and linking with half of Mesa. The winsys interface is also simpler than Mesa's; there are only a few client calls, the backbuffer is handled in the state tracker, and the winsys doesn't have to create or call into the state tracker. It took me a while to realize why the Mesa winsys was set up the way it was, and that I could simpify things on my end.

Here are some screen grabs:

Coffee mug containing two pens and a feather. Woman on the phone. Windmill in the middle of a field of yellow flowers.
Two cartoon characters (desktop visible). Fighter jet flying through a clear sky (desktop visible).

Sunday, June 29, 2008

One hurdle down, many more to go

Just a quick update; after some re-reading of the MPEG2 spec, debugging, and clean up I've finally got correct output for the MC stage for progressive video clips that use frame based and field based motion prediction. There are two other motion prediction methods, 16x8 and dual-prime, but they don't seem to be too common and shouldn't be too hard to implement anyway. It took a bit of tweaking, but comparing the output to that of other media players I see no difference, which means one hurdle down. Next steps are to revisit IDCT and start working with real hardware.

Here are some screen grabs from various test clips:

Construction site on a field.
Windmill in the middle of a field of yellow flowers.
Coffee mug containing two pens and a feather.
Woman on the phone.

Thursday, June 26, 2008

Progress

I put some work into getting field-based prediction working, and I think I have it mostly right. I ran into what I think is a bug in SoftPipe, which has to do with locking and updating textures. For some reason the surface and texture cache does not get invalidated in such cases, leading to stale texels being read and displayed. I manually flush the texture cache after mapping textures, and that seems to take care of it. It took a lot of debugging to track that one down and is probably fixed upstream, but at least it's another issue out of the way. At the moment some macroblocks are still not rendered correctly, but I'm hoping to get those out of the way.

The one thing I really can't stand is writing shader code for Gallium. The amount of C code you need write to generate a token stream for even a simple shader is obscene. Currently I have 12 shaders and each is about 200-300 lines of code for 10-15 shader instructions, so most of that code is noise. On more than one occasion I've made changes to the wrong shader just because it's so hard to wade through the code. What I wouldn't do for a simple TGSI assembler right about now. I'll have to do something about that, it's a huge eye sore.

It's not surprising that I'm a little behind on my schedule. I started on IDCT a while back but put that code down to focus on MC. Luckily IDCT isn't strictly necessary as XvMC allows for MC-only acceleration, so I can test things and move forward on MC without having to worry about IDCT. I'm hoping the next step of the project, getting things running on real hardware, will be as painless as possible allowing me to get IDCT working. However, considering all the little unforseen issues that have cropped up with SoftPipe I wouldn't be surprised if I ran into more of the same with the Nouveau driver.

Monday, June 9, 2008

Moving along

Things are moving along in the right direction. I finally got a chance to push my work to date to Nouveau's mesa git, you can check it out here. I have I, P, and B macroblocks working correctly when rendering frame pictures and using frame-based motion compensation. All that's left is to implement is field-based motion compensation (which is surprisingly very common, even in progressive content), and rendering field-based pictures (i.e. interlaced content). I think I've figured out a way to efficiently render macroblocks that use field-based prediction in one pass. Frame-based prediction works by grabbing a macroblock from a previously rendered surface and adding a difference to form the new macroblock. Field-based prediction works the same way, but references two macroblocks on the previously rendered surface, one for even scanlines and other for odd. My plan is to read from both reference macroblocks every scanline and choose which one to keep based on whether or not the scanline is even or odd. This can easily be done with a lerp(). It would be preferable to avoid the unecessary texture read, but it's simple and works in a single pass. Other alternatives include rendering the macroblock twice (once with even scanlines only, then with odd scanlines, using texkill to discard alternating scanlines), and rendering even and odd scanlines using line lists (which I understand makes sub-optimal usage of various caches in the pixel pipeline).

Monday, May 26, 2008

Something to show

I've put some more work into getting P and B frames rendering correctly and things are proceeding very well. Currently texturing from the reference frame works for P frames. All I have to do is add the differentials, which is a little tricky. The problem is that differentials are 9 bits, which means that in an A8L8 texture we get 8 bits in the L channel and 1 bit in the A channel. This shouldn't be too hard, just a bit of arithmetic in the pixel shader code. A more interesting problem is dealing with field-based surfaces, both when rendering and when using them in motion prediction. There's no straightforward way to render to even/odd scanlines on conventional hardware, so this will require some special attention. Currently I'm thinking I will have to render line lists instead of triangle lists when a macroblock uses field-based motion prediction and for rendering even/odd scanlines.

Here are some images from mpeg2play_accel, which I've been using as a test program:

Initial I-frame of the video.

Initial I-frame of the video.

Next frame, only P macroblocks using frame-based motion prediction are currently displayed, the rest are skipped.

Next frame, only P macroblocks using frame-based motion prediction are currently displayed, the rest are skipped.

Next frame, more macroblocks are rendered, and it looks mostly correct, except for the fine details. This is because the differentials are not taken into account yet.

Next frame, more macroblocks are rendered, and it looks mostly correct, except for the fine details. This is because the differentials are not taken into account yet.

Next frame, a few more unhandled macroblocks in this one.

Next frame, a few more unhandled macroblocks in this one.

Sunday, May 18, 2008

Intra-coded macroblocks? Check

After a few weeks of work I've made some good progress. Basic rendering of intra-coded macroblocks is working. What this means is that if you view a video you'll see the occasional full frame displayed correctly, and some macroblocks from the frames in between displayed correctly. Intra-coded macroblocks are the simplest to deal with, since they don't depend on motion compensation; all the data is present and you just have to render it. Every nth frame of an MPEG2 stream is composed entirely of intra-coded macroblocks. It's these frames that are currently being displayed correctly. Other frames are composed of some intra-coded macroblocks, but mostly inter-coded macroblocks. Inter-coded macroblocks depend on motion compensation and their samples are usually differentials. These I haven't gotten yet.

I've also cleaned things up a bit, added some error checking, and added some more tests. It's taken a lot of stepping through Gallium code to get things right, in leau of documentation, but thanks to GDB, and even more to Insight, I've gotten this far. Stephane has answered my questions, mostly on how to efficiently do things, and even the folks in #mplayerdev have been helpful on XvMC and general decoding matters, so all in all I would say things are going smoothly.

One thing I'm sure of is that no one reads this thing currently. The X.Org folks have asked that their students keep a blog and also submit it to planet.freedesktop.org but I was told it only accepts RSS feeds. Currently this entire web site is maintained using a text editor, so I'll have to work out something more sophisticated in the near future. :-/ (Update: Since then I've been using BlogSpot and you're probably reading this post there instead of the old page.)