This class of error is called (by me, at least) a "contoot" because, long ago, when I was writing the JBIG2 compressor for Google Books PDFs, the first example was on the contents page of book. The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".
The classifier was adjusted and these errors mostly went away. It certainly seems that Xerox have configured things incorrectly here.
Also, with Google Books, we held the hi-res original images. It's not like the PDF downloads were copies of record. We could also tweak the classification and regenerate all the PDFs from the originals.
For a scanner, I don't think that symbol compression should be used at all for this reason. For a single page, JBIG2 generic region encoding is generally just as good as symbol compression.
How would one handle the case with the tiny boxes? It seems to me that these ought to be treated more like line drawings and not unify them as symbols at all if you can't properly decompose them into lines of Latin alphabet glyphs. JBIG2 of course cleverly doesn't tell you how to do the "smart" segmentation...
Actually, that doesn't matter all that much. You ought to scan it into a TIFF file and then process it the way you want it. If you want a good JBIG2 compressor according to your liking, you have to write it yourself anyway, I don't think that the printer hardware and SW is up to that task.
The idea is actually very smart: given the infinite (and multidimensional) space of encoder solutions, fixing the bit encoding and the decompression process was very smart. It's like with PDF: it's well defined how to draw it into a bitmap but you're not constrained as to how you generate the layout, what line break algorithm you use etc.
The title, "Contents", was set in very heavy type which happened to be an unexpected edge case in the classifier and it matched the "o" with the "e" and "n" and output "Contoots".
Wouldn't it be a good idea to perform OCR - using a language model, the works - before you start classifying the JBIG2 symbols? That way, you'd have additional contextual information to say "Aha, 'contoots' is probably not what it reads here" at least in some of the cases.
Although, I realize that on "Google scale", such a complex solution could be a problem.
Language model would give you the opposite problem - eg you scan a print of _this_ page containing the word "contoots" which your language model corrects to "contents"...
JBIG2 [91] suggests using OCR to verify that you didn't mangle anything. If the compressed result has a lower success rate in matching words than the original, then you did something wrong.
The classifier was adjusted and these errors mostly went away. It certainly seems that Xerox have configured things incorrectly here.
Also, with Google Books, we held the hi-res original images. It's not like the PDF downloads were copies of record. We could also tweak the classification and regenerate all the PDFs from the originals.
For a scanner, I don't think that symbol compression should be used at all for this reason. For a single page, JBIG2 generic region encoding is generally just as good as symbol compression.
More than you want to know about this topic can be found here: https://www.imperialviolet.org/binary/google-books-pdf.pdf