If the first 4 bytes of the string are the same, the strings are likely to be equal. Similarly, the first 4 bytes of the strings can be used to determine their relative order most of the time.
Tangential: How do you guys decide which similarity metric to use? Euclidean seems the most intuitive but I'm sure others also have use cases. This situation is similar to the similarity metrics used to compare two distributions (KL for insurance, tho technically it's not a metric). Is there some art to it or is there really a systematic way to choose metrics for different tasks?
This is actually a really underappreciated question! (and also really interesting!) There's a lot of nuance to this because the truth is that distance becomes less meaningful as you increase dimensions. You can find some papers comparing Lp distances with different p values (p=2 == Euclidean == L2). But as dimension increases, the distance to the furthest points decreases (making it harder to differentiate near points from far points). Cosine similarity is a commonly used one, but as dimensions increase the likelihood that any two tensors are orthogonal rapidly increases. This might seem counterintuitive because the probability is really low in 2 or 3 dims as you only have 2 or a plane in R3.
So really the answer honestly tends to be ad hoc: "whatever works best". It's good to keep in mind that any intuition you have about geometry goes out the window as dimensions increase. It's always important to remember assumptions made, especially when focusing on empiricism. There are definitely some nuances that can point you in better directions (pun intended :) than random guessing, especially if you know a lot about your geometry, but it is messy and nuances can make big differences.
I wish I had a better answer but I hope this is informative. Maybe some mathematician will show up and add more. I'm sure there's someone on HN that loves to talk about higher dimensional geometry and I'd love to hear those "rants."
"whatever works best" because it also totally depends on your field of application. Here[0] is for example a similarity search using the Tanimoto[1] index.
The "works best" is also, in many cases, subjective. This is also not easy to assess, you may need several people looking at the results, here molecules, to say if yes they are similar or not. A chemist will think differently than a biologist in this regard.
L2 in low dims, Cosine in high dims. Hamming on similar size sets, Jaccard on others. Jensen-Shannon (symmetric KL) when dealing with histograms and probability distribution.
Cosine similarity is certainly not immune to the curse of dimensionality. In fact, it is explicitly prone to it. As dimensions increase, the likelihood that any two tensors are orthogonal rapidly increases. This is easy to reason out if you start from 2 dimensions, count the number of orthogonal vectors, and then move to 3D, and so on.
SimSIMD is already used in USearch, ClickHouse, and Lantern. In other solutions SIMD is generally limited to pre-2015 Assembly instructions. So yes, there a lot of RAG workflows to accelerate :)
As far as I'm aware pg_vector just uses the compiler's autovectorization on float32. I think specifically for Euclidean distance you won't really beat GCC (the README even admits it: "GCC handles single-precision float but might not be the best choice for int8 and _Float16 arrays, which has been part of the C language since 2011.")
If I remember correctly, for dot product it uses BLAS libraries and for Cosine distance and L2 have very minimal SIMD support. In both cases SimSIMD should be faster for distance computations.