The problems so far are as follows:
- The input format, signed 12-bit integers, isn't amongst your GPU's favourite texture formats. We're pretty fortunate that signed 16-bit integers are available at least, even though we have to renormalize to 12 bits. Unfortunately signed 16-bit integers are usually not available with four components, which leads to...
- Hefty texel fetch requirements. If we had signed 16-bit RGBA textures we could do some packing and cut down on the number of fetches by 4, but we don't. Therefore, a naive 2D IDCT would require 128 texel fetches and 64 MADDs per pixel. Even if we had the ability to pack our DCT coefficients into 4 components we would still be left with 32 texel fetches, 16 DP4s, and 15 additions. There are algorithms such as AAN, which trade multiplications for additions, but are not suited to being calculated a pixel at a time or taking advantage of our vector arithmetic units. Instead of a 2D IDCT we could opt for two 1D IDCTs, which would give us 16 texel fetches and 8 MADDs per pixel per pass, or 4 texel fetches, 2 DP4s, and 1 addition if we could do full packing. However, we really can't do 1D IDCTs efficiently because...
- We can't render to signed 16-bit buffers of any sort, so we have to find something to do with the intermediate result. The only alternative at the moment would be to render to floating point buffers, but then we get lose a whole group of GPUs that can't render to FP buffers or can but do it very slowly. But even if we manage to do a respectable IDCT, we have one more issue to deal with...
- The output of our IDCT has to be easily consumable during our motion compensation pass. Currently motion compensation is a very good fit for the GPU model, however if we have to jump through hoops to fit IDCT into the picture we don't want to make the output too cumbersome to use during motion compensation, which from an acceleration POV is the more important stage to offload.
Having said all of that, what we need to be able to do a fast IDCT isn't so much.
- I think simply having support for signed 16-bit RGBA textures would go a long way to making IDCT fit. We do have support for signed 16-bit two component textures, so at least we're half way there.
- Signed 16-bit render targets would also be helpful, although going forward FP16 and FP32 support are probably better to target.
- The dream of every programmer who has ever used SIMD instructions, the horizontal addition, would also be very useful.
- Being able to swizzle and mask components as part of a texel fetch would also help, since we receive planar data and have to pack it at some point.
However, we don't have any of that, as far as I know. What we do have is an MC implementation that currently fits very well and still has room for improvement, so at least that's a bright spot. I also have some promising ideas to tackle IDCT and its issues, and 3-4 weeks to figure it out.