Is there any reason write your own JPEG and PNG decompression, except for the educational purpose (D3DX accepts both these formats and if you want a portable solution you can use open-source libraries) ?.


As I say, I started it 10 years ago... back then I didn't even use D3D yet. The open source libraries were really slow back then (probably still are), and really terse code and clumsy to use. I used part of my JPG decoder at one point to accelerate a histogram application for a Canon PowerShot camera... It was originally based on the IJG implementation.
I just never completed my JPG decoder, so I figured I'd take care of the unfinished business. At least now I can REALLY say that I fully understand the JPG format. I have verified it myself.

PNG can probably be done in just a few hours... so why not? At least I'd have 'the complete set' then.
Posted on 2009-12-07 11:34:44 by Scali
ti_mo_n : DXT-class formats generally render that useless, but a  jpg_decode+dxt_compress on gpu could be a nice feature for projects that load from slow DVD/BD (8-10MB/s).

Yes, true. Using JPEG is still better in some/many situations, but I wanted to know about the idea of writing one's own jpeg decompression instead of using the already available solutions. Just can't think of any reason for that (except education).

As I say, I started it 10 years ago... back then I didn't even use D3D yet. The open source libraries were really slow back then (probably still are), and really terse code and clumsy to use. I used part of my JPG decoder at one point to accelerate a histogram application for a Canon PowerShot camera... It was originally based on the IJG implementation.
I just never completed my JPG decoder, so I figured I'd take care of the unfinished business. At least now I can REALLY say that I fully understand the JPG format. I have verified it myself.

PNG can probably be done in just a few hours... so why not? At least I'd have 'the complete set' then.

Sure, I understand the educational idea and the "I just want to finish my job" idea. I just wanted to know if there is -generally- any good reason that one should write their own JPEG/PNG decompression? I thought about writing my own (actually I already have, but it was slow), but seeing that DX supports it, GDI+ supports it, open-source libs support it, I can't find any good reason. So maybe you guys have tested this or that and found any real flaws that can be written better? You know - some 'objective' reasons for writing one's own decompressor, like speed, space, bugs, etc.

That's actually what I meant in my question. Sorry for any confusion ^^'
Posted on 2009-12-07 13:27:01 by ti_mo_n
Sure, I understand the educational idea and the "I just want to finish my job" idea. I just wanted to know if there is -generally- any good reason that one should write their own JPEG/PNG decompression? I thought about writing my own (actually I already have, but it was slow), but seeing that DX supports it, GDI+ supports it, open-source libs support it, I can't find any good reason. So maybe you guys have tested this or that and found any real flaws that can be written better? You know - some 'objective' reasons for writing one's own decompressor, like speed, space, bugs, etc.

That's actually what I meant in my question. Sorry for any confusion ^^'


Depends on what you want to do with it. In the case of the Canon Powershot... It was an 80186 processor with 1 mb of memory.
Clearly this asked for a custom-made JPG solution, as the regular IJG library was just not efficient enough for such a simple 16-bit processor. So by understanding both the JPG algorithm and the shortcomings of the processor, we could build a special JPG decoder for the job (since we were only interested in a histogram, we didn't even have to do a full decode either. Just decoding the Y-channel and accumulating the results directly was enough).

Another reason could be quality. Many JPG decoders have very limited precision (same goes for mp3 by the way). By building your own decoder, you can make sure that you get maximum decode quality.

I originally wrote my GIF decoder because I wanted to write my own intros... I could fit my decoder in about 2k of code... a standard GIF library would be much larger, and slower to boot (which actually matter back in those days, when I used 486 and Pentium machines). Since GIF was limited to 256 colour images, I wanted JPG support aswell, for truecolour.

I probably won't be using it for any of my D3D code, in case you're wondering. I've always used D3DX, and haven't seen any reason to use anything else.
Posted on 2009-12-07 14:09:07 by Scali
Well, I've now worked out the 'proper' way to upsample the Cb and Cr components stored in a JPG file... and I'd say I currently have a 'reference' implementation of a JPG decoder.
I do full upsampling with bilinear filter and I use double precision for the iDCT and YCbCr->RGB conversion. So the results are pretty much the highest possible quality, no corners are cut anywhere. I'll do a bit more refinement of the memory allocation here and there, as some buffers are just the maximum size rather than the smallest possible size. Then I may play around a bit with a less bruteforce iDCT and perhaps an SSE-optimized YCbCr->RGB conversion or such. Considering the speed of execution I'm already getting with the current 'bruteforce' solution, it's purely a toy project at this point... Making it faster won't serve much of a practical use, I suppose. So at any rate I won't be using optimizations that compromise quality.

I wonder what the status of decompressors is these days. The IJG implementation contains a fast integer-based iDCT solution, which has only limited precision (meant for 16-bit processors). In the early days of JPG, such fast solutions were often used, because a full solution would be too slow. Decoders would also skimp on the upsampling, as it's a reasonably costly operation when you don't have great caching and fast multiplies that we have today.
Likewise, in the early days of mp3/mpg decoding, they'd often use fast iDCT approximations and other trickery to speed it up (Intel actually provided a sample iDCT implementation for MMX, which was notorious for its limited precision (MMX only allowing 16-bit, unless you pulled some trickery), but was just copy-pasted into many decoders at the time).

I recall that WinAMP would take about 20-25% CPU when I played a 128 kbit mp3 on my Pentium 133. Another player, Sonique, would take up to 60-70% on the same machine wiht the same mp3. The sound quality was significantly better though. It just wasn't practical for playing music in the background.

So I wonder... obviously those old applications will still have limited precision when decoding... but did they ever improve the precision in newer versions, now that CPUs no longer have a problem with them? What do I get exactly, when I want to load a JPG file or an mp3 through a standard Windows decoder today?

I know video has improved lots since GPUs started accelerating them. The earliest improvement I can recall is when I got my Matrox Mystique. It had an overlay which could do hardware-accelerated upscaling of YUY2 and convert it on-the-fly.
Not only did this make playback of MPG files much less CPU-intensive, and allowed higher resolutions, but it also applied a bilinear filter, so you'd get a much smoother image (before MMX, nobody did bilinear filtering yet, it was just too expensive).
Everytime I bought a new videocard, the playback quality of my videos has improved. I suppose that makes sense, as it's one of the selling points of the videocards. There's no point in being 'the fastest', as the framerate for a video is fixed anyway. The only thing that matters is to extract as much quality as possible from the video stream.

So... I wonder if the people writing/using these lossy software decoders have paid any attention to quality, and bumped it up to the maximum, now that CPUs can handle it with ease... or do we still get the crippled decoders that were once optimized for our 486/Pentium class CPUs?

On another note... I just saw this the other day:
http://www.brightsideofnews.com/news/2009/12/8/nvidia-gf100fermi-sli-powered-maingear-pc-pictured.aspx

If these specs are correct, they sound quite good to me. The GTX380 will have about the same theoretical processing power as the GTX295. The GTX295 can outperform the HD5870 in pretty much all cases. With the GTX380 being a single-chip solution rather than SLI, and having a more modern architecture and higher bandwidth memory, I wouldn't be surprised if it was significantly faster than the GTX295 and the HD5870.
If they do, and the price and power consumption aren't excessive, I'll probaby get the GTX360 for myself.
Posted on 2009-12-10 03:51:45 by Scali
So I wonder... obviously those old applications will still have limited precision when decoding... but did they ever improve the precision in newer versions, now that CPUs no longer have a problem with them? What do I get exactly, when I want to load a JPG file or an mp3 through a standard Windows decoder today?

The only way to find out, I guess, is to write a high-quality decompression and compare it to a tool which uses, for example, GDI+ (which is VERY fast at decompression , so it most probably uses some trickery). I can quickly make a tool like that so we can compare produced images.

As for the upscaling: Isn't bicubic or lanczos filter better than bilinear one? I'm really interested in seeing the highest quality decompressor possible and comparing it to some known decompressors (D3DX, GDI+, etc).

The same goes to mp3: Comparing super high quality mp3 decompressor to winamp on my X-Fi card would be interesting at the very least ^^
Posted on 2009-12-10 11:56:09 by ti_mo_n

As for the upscaling: Isn't bicubic or lanczos filter better than bilinear one? I'm really interested in seeing the highest quality decompressor possible and comparing it to some known decompressors (D3DX, GDI+, etc).


Well sure, theoretically you can always think of a filter that is better... at the very least you can always add an extra order.
But I've never heard of anyone applying anything more than a bilinear filter for regular images/video.
Bicubic or b-spline is interesting for extreme zoom-in... but in this case you just need to scale up the Cb/Cr componenents. The filter is more about subpixel correction than about the scaling really. You get one Cb/Cr sample for every 4 Y samples. This one sample is exactly in the center of those 4. So you only need to 'move' the sample a relatively small amount, namely half a pixel (and the other 'nearest' samples are 2 pixels away). This means that a higher order filter probably won't have that much of an effect.
Posted on 2009-12-10 12:20:47 by Scali
I think I've covered most of the rough edges in my code now...
The only problem I currently have is that my upscaler is hardwired for the 4:1:1 case. Theoretically, JPG supports a wide variety of scaling. However, I don't think you'll find anything other than 4:1:1 (colour) images in the wild...
Likewise I only support the baseline lossy compression method. In theory JPG also supports some progressive formats and a lossless mode.

So there may be images that produce wrong results or even crash the decoder, but I think the chance of running into one is VERY small.

Perhaps I can put the library and my simple test program online... If you want, you can then do some quality comparisons against other decoders.
Could be a good test for my library aswell... I *think* it decodes properly, but I could be wrong :)
And even if mine decodes properly, it could still be that other decoders have better quality, so it might be interesting to study them, and see if I can improve mine further.
Perhaps a comparison against libjpeg (the IJG implementation) is also interesting. I'm not sure which one Microsoft uses... Perhaps they use the IJG one aswell, or they rolled their own.

I wonder if it is possible to reduce JPG artifacts on images with low quality encoding... eg, what if you use something like lanczos to scale it up, and then boxfilter the result down to the original size? Would the lanczos smooth out the JPG artifacts?
Posted on 2009-12-11 05:01:29 by Scali
After playing around with a few simple test-images and custom compression quality settings, I found that there are still some corner cases that I don't handle correctly.
One case is when I generated a green 32x32 image with '0%' quality in Paint.NET.
For some reason the entire quantization table comes up as -256, while it looks like 255 or 256 would be the desired result. I haven't quite figured out yet why the quantization table is not loaded correctly in that case. As far as I knew, it was just stored as an array of short ints, and with most images the quantization tables just work fine.

Another thing is that the last 2 8x8 blocks in an all-red or all-blue 32x32 image have some kind of gradient, even though I see no sign of that in the encoded data (all blocks should be the same, and I decode them all the same aswell). Could be that the DC component somehow doesn't work correctly. Again, with most images, I don't see a sign of this problem...

They are just minor issues though (that is, not too much impact, although the problems don't seem easy to find), so I could release a library/test app as-is, so ti_mo_n could try to do some comparisons.
Posted on 2009-12-17 05:00:01 by Scali
I wonder if it is possible to reduce JPG artifacts on images with low quality encoding... eg, what if you use something like lanczos to scale it up, and then boxfilter the result down to the original size? Would the lanczos smooth out the JPG artifacts?

As this page shows, b-spline filter is much better at blurring blockiness and ringing. Lanczos seems to preserve edges, which is not what you'd want in a blocky and ringy image.

Another thing is that the last 2 8x8 blocks in an all-red or all-blue 32x32 image have some kind of gradient, even though I see no sign of that in the encoded data (all blocks should be the same, and I decode them all the same aswell). Could be that the DC component somehow doesn't work correctly. Again, with most images, I don't see a sign of this problem...

DC coeff stores the average value of ALL pixels in a block. Gradients are formed by AC coefficients. If you're getting gradients in places where they aren't supposed to be, then it's most probably either a bug in IDCT or in reading AC coefficients. Might be a "it's not reading the last byte" bug, which is quite commonly made.

Anyway, give a link, so I'll compare qualities with GDI+ and Intel's JPEG lib (which is discontinued, unfortunately - now it's a part of IPP).
Posted on 2009-12-18 08:48:48 by ti_mo_n
Well, that's interesting... I wonder if it is a bug in my JPG decoder at all.
The problem of the last 2 blocks having a strange gradient occurs with both red.jpg and blue.jpg.
I've made my JPG Loader spit out all 8x8 blocks, before and after DCT. I can see that all blocks are the same, except for the last two Y blocks.
Since these are not the last blocks in the stream, and the subsequent Cb and Cr blocks are correct again, I don't think there is a problem with missing the last byte. The error doesn't occur at the end.
All blocks contain only the first value, the rest is all 0, except for those last two Y blocks. I wonder why there aren't any zeros. Why did the decoder not set the zero decoding like in the other blocks? They should be identical.
And why don't other files seem to suffer from problems in the last two blocks, or any blocks at all?

Anyway, I'll attach what I have so far.
The DLL uses this header:
// Datastructures...
typedef struct
{
unsigned short width;
unsigned short height;
unsigned int *pixels;
} TrueColorBitmap;

// Function prototypes...
int WINAPI LoadJPG(char *fileName, TrueColorBitmap *bmp);


Edit: Nevermind, I found the problem with those red and blue images. Just before the last two blocks, there was a new segment marker (so it wasn't the end of the stream, but it was the end of a segment). I've modified my bitstream reader to handle that situation correctly (or at least, I hope it does now... These two images now decode correctly, and the other test-images seem unaffected).
Posted on 2009-12-19 08:44:13 by Scali

As this page shows, b-spline filter is much better at blurring blockiness and ringing. Lanczos seems to preserve edges, which is not what you'd want in a blocky and ringy image.


I've used a bilinear filter to scale them back down to their original 100x100 size.
Not sure which one is the best. B-spline has a very smooth, soft look, but it appears to have lost quite a bit of detail.

Another thing I found... The author used IrfanView... on another blog of his I found this quote:
"Finally, the popular freeware IrfanView (Version 3.92) offers a Lanczos resizing option which uses too few sample points and therefore can produce an unwanted shadow pattern in some images. (from enlargingsplugins)".

So perhaps you could get better quality from a Lanczos filter if you used a better implementation than this one.
Attachments:
Posted on 2009-12-20 08:26:27 by Scali
My brother liked my Radeon HD5770 so much that yesterday he bought not one, but two of them.
He had bought a Core i7 860 system recently, but hadn't upgraded his videocard yet, still using a 512 mb GeForce 9800GTX. This card severely bottlenecked the rest of the system, and in most cases, my Core2Duo E6600 with Radeon 5770 actually ran games faster and with higher visual quality.

It's the first time I've played around with a multi-GPU setup. There were some problems installing them at first, but those were caused by us, not by the GPUs/drivers themselves. My brother didn't plug in one of the power connectors properly, so it booted with only one card. Then when I noticed why the second card didn't work, and I plugged it in, Windows installed a standard driver via Windows Update, which didn't match the latest Catalyst drivers that we already installed for the other card.
After the drivers were installed properly, CrossFireX was immediately enabled, and worked fine in a game of Crysis: Warhead. The framerates were excellent on his 1280x1024 screen, even with everything maxed out, including 8xAA. It was generally in the 35-60 fps region, averaging around 50 fps.

Video playback and his digital cable TV tuner didn't work quite as smoothly. At first, it seemed like the hardware deinterlacing was broken... But perhaps it was because the settings for the nVidia card still 'stuck'. After meddling with the codec settings, things started to look better.
With the Terratec software for his TV tuner, we saw some really nasty deinterlacing bugs. When you were watching a HD channel in a window, you got huge blocky artifacts... as if the deinterlacing was applied AFTER rescaling rather than BEFORE... or something like that.
As it turned out, it wasn't a codec problem, but the video renderer used (which I suppose is the thing that adds subtitles, OSD and that sort of thing). It was set to Windows Media 9 or something (which my brother claims to have worked fine on nVidia... I wouldn't know, never saw it). When we changed it to the "Enhanced renderer" setting, all looked great.
When the card is working properly, it really does have amazing mpg decoding quality. It looked better than the Samsung set-top box that I use at home, and that one already has very good image quality. The main weakness of the Samsung seems to be in YUV->RGB conversion... There's this channel that has a diamond-shaped red logo in the top-right corner. The diagonal red edges usually have bleeding from the underlying image. We could detect nothing of the sort on the ATi card.
Posted on 2009-12-24 06:29:57 by Scali
Images:
test01_orig.png - the original image. Gradient-filled balls to show the blocking artifacts and sharp crosses/stars to show the ringing artifacts.

test01_75.jpg - image @ Q:75
test01_75.jpg.GDIP.png - the above image decompressed by GDI+

test01_75_down.jpg - image @ Q:75, with subsampling
test01_75_down.jpg.GDIP.png -decompressed by GDI+
test01_75_down.jpg.scali.png - decompressed by Scali's loader

Scali's loader hasn't been tested with images that don't have subsampling because it doesn't support such images ^^'

Q means relative quality. Goes from 0 (very poor image) to 100 (super-high quality)

Anyway, the conclusions:
1. Subsampling further degrades image quality while offering very little difference in file size.
2. GDI+ decompression produces higher quality images than Scali's loader (esp. much less ringing) when the Q factor is very low.
3. Scali's loader produces a little bit darker images.

#2 and #3 suggest that there is a bug in Scali's loader.

The code doesn't use any magic: The image is simply created using Gdiplus::Image::Image(WCHAR*,BOOL) constructor and then saved as PNG with Gdiplus::Image::Save(WCHAR*,CLSID*,EncoderParameters*) method.
Posted on 2009-12-29 21:23:04 by ti_mo_n
Remaining images:

test01_25.jpg - image @ Q:25
test01_25.jpg.GDIP.png - decompressed by GDI+

test01_25_down.jpg - image @ Q:25, with subsamling.
test01_25_down.jpg.GDIP.png - decompressed by GDI+
test01_25_down.jpg.scali.png - decompressed by Scali's loader

test01_0.jpg - image @ Q:0
test01_0.jpg.GDIP.png - decompressed by GDI+

test01_0_down.jpg - image @ Q:0, with subsampling
test01_0_down.jpg.GDIP.png - decompressed by GDI+
test01_0_down.jpg.scali.png - decompressed by Scali's loader
Posted on 2009-12-29 21:24:26 by ti_mo_n
I've noticed #3, but I use the official ITU-R BT.601 coefficients for converting YCbCr to RGB, as defined by the JPG standard. I don't think there's any other way that could explain differences in colours (I use a reference 'bruteforce' IDCT with double precision, so that can't be it).
So is it me who's doing it wrong, or is it GDI+? I have read that especially with MPG/MP4 decoders, the conversion is often not entirely 'correct' for speed considerations. I'm not sure if that also goes for JPG though, and GDI+ in particular.
Unless... it's just darker by a constant factor... perhaps I need to change my rounding mode?
But I've seen some applications that just have totally different shades of red for example, with the rose image. Weird.

As for the 'ringing'.... There may be some additional filtering going on in GDI+?
As I say, mine is a 'reference' implementation, I do everything by the book, and no additional post-processing.
I don't actually see what you mean with 'less ringing' though.
I see the same type of aliasing on both.
The aliasing itself differs slightly... but perhaps GDI+ doesn't use a bilinear filter... it could explain why the ringing has slightly different patterns.
Although... since it seems to be dependent on the quality of compression, it looks more like it's a result of dequantization. After all, the bilinear filter always works the same, regardless of the quality mode. It's dependent on the input from the YCbCr blocks.
Why didn't you just overlay the decoded images and subtract them or such? It would give a better idea of how large the differences are and where they are exactly. See attached.

And yes, I know I currently only support 4:1:1 formats. I hardcoded the bilinear filter. It wouldn't be very difficult to add 4:4:4, but it's an extremely rare format. Theoretically I should support every possible permutation, but I wonder if any JPG decoder really does. Things get quite tricky if you want to support every possible permutation and still deliver good filtering quality.
Posted on 2009-12-30 05:17:45 by Scali
I've changed the rounding on the YCbCr->RGB routine, so all colours should be slightly brighter.
I've also added a simple handler for both 4:1:1 and 1:1:1 formats, so now the non-downsampled JPGs can be decoded aswell.

I have extended the bitmap structure to this:
typedef struct
{
unsigned short width;
unsigned short height;
unsigned short padWidth;
unsigned short padHeight;
unsigned int *pixels;
} TrueColorBitmap;


This way it's easier to handle the decompression of any width... the decompressed bitmap is always a multiple of 8, and depends on the scale factor. Not all pixels at the edges may actually belong to the image. Width and height are the actual image size, and padWidth and padHeight are the dimensions of the decompressed image, which may be larger. So padWidth is the actual width of a scanline in memory.

Initial tests with the non-subsampled images reveal the same differences in the 'ringing' as the subsampled images... So I think that reinforces my theory that it's not the bilinear filter (these images not having any filtering applied, since all components have the same sample frequencies), but either the quantization stage that's doing something slightly different with GDI+, or there's some kind of additional processing going on with GDI+.
It would appear that the difference is the largest with the blue and black 'star' patterns. These also have the largest contrast differences (white -> blue and white -> black). So it could have something to do with some kind of filtering or gamma correction or such.
It seems to affect the diagonal components in particular... and more in some places than in others. Could relate to the position in the 8x8 block, which would affect the amount of quantization applied...
Attachments:
Posted on 2009-12-30 16:16:34 by Scali
Getting back to the 3D engine... The way I currently dispatch multiple DLLs for different versions of D3D isn't really a good long-term solution.
I'll have to do a proper redesign of it, which sadly is blocking my progress with the rest of the engine.
So I'm warming up to the idea of building a new OpenGL framework, which I was planning for my BHM file format. At least I would be starting from scratch, and that may be a welcome breath of fresh air after tinkering with D3D engine for such a long time while not really getting anywhere.
Posted on 2010-01-06 08:47:49 by Scali
Well, I'm watching, this affects me, and it's not an issue I've dealt with yet.
Most people can't keep up with us, I've been told so.
I know that game development covers a wide range of programming fields - perhaps more than any other digital pursuit, however that should not be a discouragement for novices, the programmer in them should respond in kind!
Posted on 2010-01-06 09:22:40 by Homer
this affects me


Probably not, as the issue I'm not happy with is with the MFC framework I'm using.
Posted on 2010-01-22 06:37:44 by Scali
That's exactly what I thought, and rallied against, creating my own until Biterider decided to write a better one.
The application framework I currently use for dx stuff is essentially his.
It's far better than any of mine.
But it also has some deep seated issues that need addressing.
I'm just too lazy to do it until someone other than me notices!
Posted on 2010-01-22 08:15:45 by Homer