Sunday, September 14, 2008

GSoC is over, how did Generic GPU-Accelerated Video Decoding do?

So GSoC has come to a close, and this project was successful, in that there is a working XvMC implementation sitting in nouveau/mesa's gallium-0.1 branch. Currently the NV40 Gallium driver is the only one complete enough to run XvMC, and there are still a few missing features (no support for interlaced video, subpictures aren't implemented yet, only motion compensation is currently accelerated).

In my last entry I mentioned that I was hoping to spend the last part of GSoC getting IDCT working, but I came to realize that this would probably require more work than I initially estimated, due to the limited render target formats GPUs support. We decided that we may also want to take advantage of fixed function IDCT hardware if it is available, and one of the other Nouveau contributors had been looking into this on NV40, so I'm hoping we can take advantage of his efforts and get that into the Gallium NV40 driver in some fashion. Instead I spent the last two weeks of GSoC and the first two weeks of the rest of my life focusing on performance and cleaning up a few bugs here and there.

As far as performance goes, we managed to grab most of the low hanging fruit.

  • We buffer an entire frame of content and fire that off with a few draw calls. Most frames, depending on their content, can be done in two draw calls.

  • Because we have to fill buffers with new content each frame, we don't necessarily want to wait until the GPU is done with those buffers before we map and update them. Since we don't need their old contents we can just allocate a set of buffers and rotate them, double buffer style.

  • For P and B frames many blocks are composed entirely of pixels from the reference frame(s), so we don't technically need to upload any new data.

    Previously we would clear that block of the source texture to black, so that it didn't contribute anything to the destination block. However, for most P and B frames a significant number of blocks fall into this category, and most frames are P or B frames, so that's a lot of useless clearing on the CPU side and texel fetching on the GPU side.

    To get around this we clear the first such zero block of each frame for the luma and two chroma channels, and for subsequent zero blocks we texture from that first block. This saves a nice chunk of CPU time, but doesn't do much for GPU texture bandwidth.

    Once I figure out how TGSI expresses flow control constructs I'm hoping we can just set the texcoords for zero blocks to the negative range and conditionally tex fetch, but for older hardware which doesn't support conditional execution the current path should be good.
Having said all that however, 720p24 decoding is still not done in real time. It's kind of a mystery actually, because while the profiler seems to indicate that we are GPU limited rather than CPU limited, the numbers don't seem to add up. A 1280x720 video is composed of 80x45 macroblocks. Each macroblock is composed of 4 blocks, each block is rendered as two triangles, so thats 8 triangles per macroblock, or ~29K triangles per frame. At 24 fps thats ~696K tris/sec or ~2M vertices/sec. Nvidia quotes a GeForce 6200's vertex processing rate at 225M/sec. Our vertex shaders are very simple, we use screen aligned tris in normalized coords, so we don't have to do any significant transforming, just move inputs to outputs.

Similarly, a 1280x720 video is composed of ~922K pixels. At 24 fps we're rendering ~22M pixels/sec. In the worst case, each pixel requires 5 texel fetches (3 2-byte fetches, 2 4-byte fetches) and one 4-byte write to the frame buffer, so that brings us to 308M bytes/sec read and 88M bytes/sec write. The color conversion pass adds another 352M bytes/sec for read and 88M bytes/sec for write. Nvidia quotes a 6200's fillrate as 1.2-1.4B texels/sec, and assuming those texels are 32-bit, that works out to 4.8-5.6B bytes/sec. Again, our pixel shaders are not really complicated, mostly TEX2Ds, MULs, and ADDs. Omitting the tex fetching doesn't change much, neither does disabling color writes to the frame buffer. Regardless of how Nvidia calculates its marketing numbers we seem to be well below them so it probably doesn't matter how optimistic they are.

All in all it seems very odd that, given the above, a 854x480 clip renders in real time, but the same clip at 1280x720 takes 4x longer, despite only being 2.25x larger. I suspect that either there is a very non-obvious bug in the state tracker, or that we are doing something odd in the driver, possibly in the way we set up the 3D state, submit commands and data, or with memory management, or possibly with how we get our frame buffer onto the X window.

Either way, I hope to continue working on this now that GSoC is over, and anyone who is interested in contributing is free to do so. I'm hoping to move things over to Mesa's GIT sooner or later, and I'm curious to see how it does on Intel's hardware. I don't know how well the current rendering process fits with what Intel supports, but if single component signed 16-bit textures aren't a problem it should be very easy to get things up and running. At best all it needs is some minor changes in the Winsys layer.

Saturday, August 2, 2008

IDCT vs. the GPU

I've come to understand a few things while talking to Stephane (marcheu) and trying to come up with a (hopefully fast) way of performing IDCT on a typical GPU: 1) It doesn't fit nearly as well as motion compensation does, and 2) it wouldn't necessarily take a radical departure from current designs to make it fit, just a few adjustments here and there. I think the second point is the more frustrating of the two.

The problems so far are as follows:

  1. The input format, signed 12-bit integers, isn't amongst your GPU's favourite texture formats. We're pretty fortunate that signed 16-bit integers are available at least, even though we have to renormalize to 12 bits. Unfortunately signed 16-bit integers are usually not available with four components, which leads to...

  2. Hefty texel fetch requirements. If we had signed 16-bit RGBA textures we could do some packing and cut down on the number of fetches by 4, but we don't. Therefore, a naive 2D IDCT would require 128 texel fetches and 64 MADDs per pixel. Even if we had the ability to pack our DCT coefficients into 4 components we would still be left with 32 texel fetches, 16 DP4s, and 15 additions. There are algorithms such as AAN, which trade multiplications for additions, but are not suited to being calculated a pixel at a time or taking advantage of our vector arithmetic units. Instead of a 2D IDCT we could opt for two 1D IDCTs, which would give us 16 texel fetches and 8 MADDs per pixel per pass, or 4 texel fetches, 2 DP4s, and 1 addition if we could do full packing. However, we really can't do 1D IDCTs efficiently because...

  3. We can't render to signed 16-bit buffers of any sort, so we have to find something to do with the intermediate result. The only alternative at the moment would be to render to floating point buffers, but then we get lose a whole group of GPUs that can't render to FP buffers or can but do it very slowly. But even if we manage to do a respectable IDCT, we have one more issue to deal with...

  4. The output of our IDCT has to be easily consumable during our motion compensation pass. Currently motion compensation is a very good fit for the GPU model, however if we have to jump through hoops to fit IDCT into the picture we don't want to make the output too cumbersome to use during motion compensation, which from an acceleration POV is the more important stage to offload.

Having said all of that, what we need to be able to do a fast IDCT isn't so much.

  • I think simply having support for signed 16-bit RGBA textures would go a long way to making IDCT fit. We do have support for signed 16-bit two component textures, so at least we're half way there.

  • Signed 16-bit render targets would also be helpful, although going forward FP16 and FP32 support are probably better to target.

  • The dream of every programmer who has ever used SIMD instructions, the horizontal addition, would also be very useful.

  • Being able to swizzle and mask components as part of a texel fetch would also help, since we receive planar data and have to pack it at some point.

However, we don't have any of that, as far as I know. What we do have is an MC implementation that currently fits very well and still has room for improvement, so at least that's a bright spot. I also have some promising ideas to tackle IDCT and its issues, and 3-4 weeks to figure it out.

Monday, July 21, 2008

Up and running on real hardware

I reached a nice milestone today: working playback on my Geforce 6200. Most of the work went into the winsys layer, with some bug fixes and workarounds in other places, but everything is up and running now. Unfortunately the output isn't perfect, there is some slight corruption here and there. I'm guessing it has to do with some dodgy assumptions I made about shader arithmetic (rounding, saturation, etc) that SoftPipe went along with but the GPU didn't. The other issue is that there is some severe slowdown when any 2D drawing happens on the rest of the desktop. I'm guessing this may be due locking when copying the backbuffer to the window, or maybe I'm completely soaking up the CPU.

Currently nothing is optimized, I'm not even turning on compiler optimization, and I have a really slow prototype IDCT implementation performed on the CPU in place of the hardware version, so I'm sure I'm eating up a lot more CPU time than I will be by the end of the summer. I have a lot of different ideas on optimization that will target CPU usage and GPU fillrate usage, but given that I get almost full speed playback currently, I'm pretty confident that I'll be able to get HD playback by the end of SoC.

As far as the winsys goes, I was able to use most of the current Nouveau winsys. Unfortunately the DRI stuff is buried within Mesa right now, so I had to extract a lot of things and create a standalone library to handle screens, drawables, the SAREA, etc. to be able to use DRI without including and linking with half of Mesa. The winsys interface is also simpler than Mesa's; there are only a few client calls, the backbuffer is handled in the state tracker, and the winsys doesn't have to create or call into the state tracker. It took me a while to realize why the Mesa winsys was set up the way it was, and that I could simpify things on my end.

Here are some screen grabs:

Coffee mug containing two pens and a feather. Woman on the phone. Windmill in the middle of a field of yellow flowers.
Two cartoon characters (desktop visible). Fighter jet flying through a clear sky (desktop visible).

Sunday, June 29, 2008

One hurdle down, many more to go

Just a quick update; after some re-reading of the MPEG2 spec, debugging, and clean up I've finally got correct output for the MC stage for progressive video clips that use frame based and field based motion prediction. There are two other motion prediction methods, 16x8 and dual-prime, but they don't seem to be too common and shouldn't be too hard to implement anyway. It took a bit of tweaking, but comparing the output to that of other media players I see no difference, which means one hurdle down. Next steps are to revisit IDCT and start working with real hardware.

Here are some screen grabs from various test clips:

Construction site on a field.
Windmill in the middle of a field of yellow flowers.
Coffee mug containing two pens and a feather.
Woman on the phone.

Thursday, June 26, 2008

Progress

I put some work into getting field-based prediction working, and I think I have it mostly right. I ran into what I think is a bug in SoftPipe, which has to do with locking and updating textures. For some reason the surface and texture cache does not get invalidated in such cases, leading to stale texels being read and displayed. I manually flush the texture cache after mapping textures, and that seems to take care of it. It took a lot of debugging to track that one down and is probably fixed upstream, but at least it's another issue out of the way. At the moment some macroblocks are still not rendered correctly, but I'm hoping to get those out of the way.

The one thing I really can't stand is writing shader code for Gallium. The amount of C code you need write to generate a token stream for even a simple shader is obscene. Currently I have 12 shaders and each is about 200-300 lines of code for 10-15 shader instructions, so most of that code is noise. On more than one occasion I've made changes to the wrong shader just because it's so hard to wade through the code. What I wouldn't do for a simple TGSI assembler right about now. I'll have to do something about that, it's a huge eye sore.

It's not surprising that I'm a little behind on my schedule. I started on IDCT a while back but put that code down to focus on MC. Luckily IDCT isn't strictly necessary as XvMC allows for MC-only acceleration, so I can test things and move forward on MC without having to worry about IDCT. I'm hoping the next step of the project, getting things running on real hardware, will be as painless as possible allowing me to get IDCT working. However, considering all the little unforseen issues that have cropped up with SoftPipe I wouldn't be surprised if I ran into more of the same with the Nouveau driver.

Monday, June 9, 2008

Moving along

Things are moving along in the right direction. I finally got a chance to push my work to date to Nouveau's mesa git, you can check it out here. I have I, P, and B macroblocks working correctly when rendering frame pictures and using frame-based motion compensation. All that's left is to implement is field-based motion compensation (which is surprisingly very common, even in progressive content), and rendering field-based pictures (i.e. interlaced content). I think I've figured out a way to efficiently render macroblocks that use field-based prediction in one pass. Frame-based prediction works by grabbing a macroblock from a previously rendered surface and adding a difference to form the new macroblock. Field-based prediction works the same way, but references two macroblocks on the previously rendered surface, one for even scanlines and other for odd. My plan is to read from both reference macroblocks every scanline and choose which one to keep based on whether or not the scanline is even or odd. This can easily be done with a lerp(). It would be preferable to avoid the unecessary texture read, but it's simple and works in a single pass. Other alternatives include rendering the macroblock twice (once with even scanlines only, then with odd scanlines, using texkill to discard alternating scanlines), and rendering even and odd scanlines using line lists (which I understand makes sub-optimal usage of various caches in the pixel pipeline).

Monday, May 26, 2008

Something to show

I've put some more work into getting P and B frames rendering correctly and things are proceeding very well. Currently texturing from the reference frame works for P frames. All I have to do is add the differentials, which is a little tricky. The problem is that differentials are 9 bits, which means that in an A8L8 texture we get 8 bits in the L channel and 1 bit in the A channel. This shouldn't be too hard, just a bit of arithmetic in the pixel shader code. A more interesting problem is dealing with field-based surfaces, both when rendering and when using them in motion prediction. There's no straightforward way to render to even/odd scanlines on conventional hardware, so this will require some special attention. Currently I'm thinking I will have to render line lists instead of triangle lists when a macroblock uses field-based motion prediction and for rendering even/odd scanlines.

Here are some images from mpeg2play_accel, which I've been using as a test program:

Initial I-frame of the video.

Initial I-frame of the video.

Next frame, only P macroblocks using frame-based motion prediction are currently displayed, the rest are skipped.

Next frame, only P macroblocks using frame-based motion prediction are currently displayed, the rest are skipped.

Next frame, more macroblocks are rendered, and it looks mostly correct, except for the fine details. This is because the differentials are not taken into account yet.

Next frame, more macroblocks are rendered, and it looks mostly correct, except for the fine details. This is because the differentials are not taken into account yet.

Next frame, a few more unhandled macroblocks in this one.

Next frame, a few more unhandled macroblocks in this one.

Sunday, May 18, 2008

Intra-coded macroblocks? Check

After a few weeks of work I've made some good progress. Basic rendering of intra-coded macroblocks is working. What this means is that if you view a video you'll see the occasional full frame displayed correctly, and some macroblocks from the frames in between displayed correctly. Intra-coded macroblocks are the simplest to deal with, since they don't depend on motion compensation; all the data is present and you just have to render it. Every nth frame of an MPEG2 stream is composed entirely of intra-coded macroblocks. It's these frames that are currently being displayed correctly. Other frames are composed of some intra-coded macroblocks, but mostly inter-coded macroblocks. Inter-coded macroblocks depend on motion compensation and their samples are usually differentials. These I haven't gotten yet.

I've also cleaned things up a bit, added some error checking, and added some more tests. It's taken a lot of stepping through Gallium code to get things right, in leau of documentation, but thanks to GDB, and even more to Insight, I've gotten this far. Stephane has answered my questions, mostly on how to efficiently do things, and even the folks in #mplayerdev have been helpful on XvMC and general decoding matters, so all in all I would say things are going smoothly.

One thing I'm sure of is that no one reads this thing currently. The X.Org folks have asked that their students keep a blog and also submit it to planet.freedesktop.org but I was told it only accepts RSS feeds. Currently this entire web site is maintained using a text editor, so I'll have to work out something more sophisticated in the near future. :-/ (Update: Since then I've been using BlogSpot and you're probably reading this post there instead of the old page.)

Thursday, May 1, 2008

Up and running

Today I managed to get the basic color conversion step up and running using SoftPipe. Most of the difficulty came in understanding Gallium more than implementing the color conversion stuff. I spent many hours trying to figure out why I couldn't get any geometry to show up in my window. Copying surfaces to the frame buffer worked fine, but rendering a triangle left me staring at a black screen. It turns out that you have to set the pipe_blend_state.colormask bits for the channels you want to write to. First, I didn't even consider that state because I disabled blending. Second, setting the mask to allow writes was the opposite of what I would assume. It took several hours of stepping through Gallium to find that everything was OK until we got to the fragment shader, where it skipped the frame buffer write back.

Other issues included getting a handle on writing TGSI shader code and figuring out how to get Gallium and XvMC APIs to agree. At the moment generating TGSI isn't a pretty process, you can look in gallium/auxiliary/util/u_simple_shaders.c for an example. As for Gallium and XvMC agreeing, most of the problem came from the fact that XvMC functions all accept a Display*. What do you do if the client creates an XvMC context with one Display*, creates a surface with another Display*, and so on? Well, hopefully no one will do that, but one has to wonder why it's even allowed. Then there's the issue of some calls only taking an XvMCSurface*, and not the associated context. Unfortunately the context is where I keep the Gallium pipe context, so every surface has to have a reference to the context it was created with. Luckily this works out since some functions that do take a surface and context require that we check that they match, so at least it makes that simple.

Friday, April 25, 2008

Accepted, digging through code

After a long interim period I now know that the proposal has been accepted. Rather than sit around I've been working on getting things up and running, so I'm glad I got a head start on things. It took a lot of digging through Mesa code and some questions to the dri-devel mailing list and IRC channel, but I've managed to get some basic initialization out of the way. I've also implemented enough functionality and stubs to get some basic test cases compiling and running successfully. I found a port of mpeg2play on Mark Vojkovich's web site that uses XvMC and have managed to get that compiling and running. By running I mean not crashing, it doesn't display anything as of yet, but at least I'm heading in the right direction.

Now I'll need to figure out how to get XvMC surfaces onto X drawables with Gallium3D. For some reason most of the XvMC functions don't take the XvMCContext as an argument, so I have to store that along with each surface, and yet they all take a pointer to Display, which I don't see a use for. A headache more than anything else, but it seems counter-intuitive to me. Also, the Gallium3D API is new to me and it will take some time to figure out. Keith Whitwell provided me with an in depth explanation of how to start on the state tracker and winsys thankfully. I'm hoping by the end of this weekend I'll have something on screen, even if it's garbage (i.e. the video frames before IDCT). I'm also hoping to get started on writing shader code to do the color conversion. Stephane Marchesin, my mentor for this project was kind enough to point me to the current Xv implementation for the Nouveau driver, which does color conversion and bicubic interpolation in shaders currently.

Monday, April 7, 2008

Submittion day redux

So after last week's deadline extension today became the deadline for the proposal submission. It hasn't really affected me, but it would have been nice to know by now if this project had been accepted by now. Instead the accepted proposals will be announced April 21st.

In the interim I've been looking through the libXvMC and Mesa sources. The source to libXvMC is a little confusing, partly because of the wrapper library that comes with it, but after a little reading, grep-ing, and peeking at the openChrome XvMC driver I think I've got a handle on how things work. As far as I can see, the libXvMC module provides implementations for all the hardware-agnostic XvMC API calls, and leaves the rest to the driver. It also exports some functions that the driver can use to interface with X. The wrapper module, libXvMCW is intended for clients to link against and exports all the XvMC functions the client expects. The wrapper doesn't contain any implementation, but instead attemps to dynamically load the libXvMC module for the hardware-agnostic functions, and a hardware-specific driver (e.g. libXvMCNvidia) for the rest of the functions. The driver is left to implement the surface/block/rendering related functions. So with that I think it's pretty clear which functions I would have to provide in terms of Gallium3D to complete the implementation.

In addition to that I've gotten back into using Matlab for some prototyping. Matlab is a great tool for this sort of thing because it allows you to easily visualize your data, and I've been using it to test some CSC and IDCT routines.

Monday, March 31, 2008

Submittion day

Today is the deadline for the proposal submission. I hope it is accepted, I think it's a perfect fit for GSoC. I had some trouble trimming it down to 7500 characters because I initially misread the guidelines and thought it said 7500 words, if it wasn't for someone on the dri-devel mailing list reminding me I'd have submitted all 8K characters. I'm feeling lucky already!

Saturday, March 29, 2008

Potential benefits of accelerated video decoding?

I've started considering some of the potential benefits of hardware accelerated video decoding from a user's perspective. The biggest one for me is being able to play back HD streams in real-time. I have a modest machine and it does struggle with HD streams, but having read this paper I'm encouraged by one statement in particular. Testing on a machine equipped with a Pentium III @ 667 MHz, 256 MB of memory, and a GeForce 3 Ti200, they state that they were able to play back a 720p ~24-frame/s stream encoded with WMV at 5 Mb/s. That hardware is pretty ancient by today's standards, and yet with the GPU handling MC and CSC they get a 3.16x speed-up over the CPU w/ MMX implementation. That's pretty encouraging. I'm sure all the folks out there who use their machines as HTPCs would really appreciate that sort of performance.

Another benefit would be that implementing this in terms of a Gallium3D front-end allows it to be used on all hardware that has a Gallium3D back-end. Currently I believe certain Intel GPUs are supported very well, as well as Nvidia through Nouveau, but I'm thinking specifically of AMD/ATI. As far as I know their Linux drivers have never supported any sort of video decoding acceleration, even though the hardware is very capable and the functionality is implemented on Windows. Recently they released hardware specs for some of their GPUs, but this did not include any of the dedicated video decoding hardware if I recall correctly. However, with specs for the newer GPUs and various reverse-engineered drivers for older GPUs already existing, comprehensive Gallium3D support for ATI GPUs will probably happen. I think the point is obvious by now: hardware accelerated video decoding on ATI GPUs finally.

Monday, March 24, 2008

Jumping right in

To get myself familiar with how a Gallium3D front-end works I downloaded the source to Mesa and built the library. I had a little trouble trying to figure out which make target built Mesa using SoftPipe, but someone on #dri-devel was kind enough to tell me. I also downloaded openChrome's libXvMC source, but it's not immediately clear to me how the library works. It appears to do some work in terms of Xlib, xext, and others, but expects a few functions (the ones that actually touch the hardware) to be provided and linked in. Odds are this is where the Gallium3D calls will end up.