I tried that. It's an improvement, but libtommath is still much slower than the other two.
digits = 100: 4.31597e-05 s
digits = 1000: 0.000249898 s
digits = 10000: 0.00252304 s
digits = 100000: 0.0618554 s
digits = 1000000: 5.24447 s
digits = 10000000: 703.01 s
Certainly he has only basecase division, which explains the original benchmark timings.
I'm not very familiar with libtommath, though I had a passing experience with it back in about 2004-2005. However, looking through the code, he does seem to have toom3 (I have both toom3 and toom32, which may or may not be relevant here -- the latter is for unbalanced multiplication).
But more obvious is that libtommath seems to check for errors after every operation, even internally. I think this is some kind of exception mechanism.
I also recall there being a libtomcrypt at some point. Maybe it still exists. That suggests that Tom is possibly focusing on the much more difficult area of crypto, where your code needs to much, much more defensive.
Also, libtommath claims to be 100% C, which bsdnt is not. We use some assembly language, which gives us 5-30% speedup (we could get about another factor of 2 if we unrolled the loops like the C compiler we are comparing against).
Those are a few of the things I can see.
But performance isn't everything. It has never been my intention to be compared with GMP performance-wise for example. You simply cannot beat GMP without being as technical as GMP. The focus here is simplicity and reliability (not to the extremes required for crypto though), and it always will be.
Roughly, my goals with this project were to eventually be much faster than say Python bignums, maybe within a factor of 2 or so of GMP on generic problems, but with code that could be maintained by language designers themselves, without being bignum experts.
Thanks for the clear explanation. And great job on writing a high-performing Bignum library. From these benchmarks, it looks like the go-to for a high-performing, permissively licensed Bignum library.