Hi! I worked with product quantization in the past in the context of a library I released to read LLMs stored in llama.cpp format (GUFF). However, in the context of in-memory HNSWs, I found them to make a small difference. The recall is already almost perfect with int8. Of course it is very different in the case you are quantizing an actual neural network with, for instance 4 bit quants. There it will make a huge difference. But in my use case I picked what would be the fastest, given that both performed equally well. What could be potentially done with PQ in the case of Redis Vector Sets is to make 4 bit quants work decently (but not as well as int8 anyway), however given how fat the data structure nodes are per-se, I don't think this is a great tradeoff.}
All this to say: the blog post tells mostly the conclusions, but to reach that design, many things were tried, including things that looked cooler but in the practice were not the best fit. It's not by chance that Redis HNSWs are easily able to go 50k full queries/sec in decent hardware.
if you're getting near-perfect recall with int8 and no reranking then you're either testing an unusual dataset or a tiny one, but if it works for you then great!
Near perfect recall VS fp32, not in absolute terms: TLDR, it's not int8 to ruin it, at least if the int8 quants are computed per-vector and not with global centroids. And also, recall is a very illusionary metric, but this is an argument for another blog post (In short, what really matters is that the best candidates are collected: the long tail is full of elements that are anyway far enough or practically equivalent, since this happens under the illusion that the embedding model already captures the similarity our application demands. This is, indeed, already an illusion, so if the 60th result is 72th, it normally does not matter. The reranking that really matters (if there is the ability to do that) is the LLM picking / reranking: that, yes, makes all the difference.
All this to say: the blog post tells mostly the conclusions, but to reach that design, many things were tried, including things that looked cooler but in the practice were not the best fit. It's not by chance that Redis HNSWs are easily able to go 50k full queries/sec in decent hardware.