This class of error is called (by me, at least) a "contoot" because, long ago, w...

gngeal · on Aug 4, 2013

How would one handle the case with the tiny boxes? It seems to me that these ought to be treated more like line drawings and not unify them as symbols at all if you can't properly decompose them into lines of Latin alphabet glyphs. JBIG2 of course cleverly doesn't tell you how to do the "smart" segmentation...

linohh · on Aug 5, 2013

Yeah, and because the libraries are not open source, we'll never be able to check who failed big time.

gngeal · on Aug 5, 2013

Actually, that doesn't matter all that much. You ought to scan it into a TIFF file and then process it the way you want it. If you want a good JBIG2 compressor according to your liking, you have to write it yourself anyway, I don't think that the printer hardware and SW is up to that task.

The idea is actually very smart: given the infinite (and multidimensional) space of encoder solutions, fixing the bit encoding and the decompression process was very smart. It's like with PDF: it's well defined how to draw it into a bitmap but you're not constrained as to how you generate the layout, what line break algorithm you use etc.

gngeal · on Aug 5, 2013

It just occurred to me...

The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".

Wouldn't it be a good idea to perform OCR - using a language model, the works - before you start classifying the JBIG2 symbols? That way, you'd have additional contextual information to say "Aha, 'contoots' is probably not what it reads here" at least in some of the cases.

Although, I realize that on "Google scale", such a complex solution could be a problem.

cmarschner · on Aug 5, 2013

Language model would give you the opposite problem - eg you scan a print of _this_ page containing the word "contoots" which your language model corrects to "contents"...

ygra · on Aug 5, 2013

JBIG2 [91] suggests using OCR to verify that you didn't mangle anything. If the compressed result has a lower success rate in matching words than the original, then you did something wrong.

[91] http://jbig2.com/