> Apart from search, ANNs can be use for recommendations, classification, and other information retrieval problems.
Yeah, those are all good use cases. I was wondering about a different thing: will the demand for a distributed vector search service concentrate to a few big companies, as smaller companies can use a simpler solution so they don't really need to pay for the technology.
> Currently, ES and Solr, both based on Lucene, can't really manage vector representations, as they are mainly based on inverted indexes to n-grams.
ES has kNN plugin, which stores vectors separately in each segment in Lucene index. Plus, they can also use better storage formats and algorithms.
I guess it depends on what you mean by "simple". The algorithms are complex but there are good tools that implement them. I would imagine smaller companies would use off the shelf tooling, and I would argue that is simpler. Vector embeddings are so unbelievably powerful and often yield better results than classical methods with one of the good tools + pretrained embeddings.
Specifically for search, I use them to completely replace stemming, synonyms, etc in ES. I match the query's embedding to the document embeddings, find the top 1000 or so. Then I ask ES for the BM25 score for that top 1000. I combine the embedding match score with BM25, recency, etc for final rank. The results are so much better than using stemming, etc and it's overall simpler because I can use off the shelf tooling and the data pipeline is simpler.
> I match the query's embedding to the document embeddings,
I assume the doc size is relatively small, otherwise a document may contain too many different topics that make it hard to differentiate different queries?
For my search use case, documents are mostly single topic and less than 10 pages. However I have found embeddings still work surprisingly well for longer documents with a few topics in them. But yes, multi-topic documents can certainly be an issue. Segmentation by sentence, paragraph, or page can help here. I believe there are ML-based topic segmentation algorithms too, but that certainly starts making it less simple.
The moment you cross 10M items or 100 QPS then scaling such a system becomes non-trivial. That's not a high threshold for any enterprise software company handling customer data or any consumer tech company with >10M users. Once you add other requirements to the mix, such as index freshness and metadata filtering, the managed options where this is already built-in start to become compelling even at lower volumes.
Also, Pinecone (disclosure: I work there) has usage-based billing that starts at $72/month, so "paying for the technology" is not that scary.
Yeah, those are all good use cases. I was wondering about a different thing: will the demand for a distributed vector search service concentrate to a few big companies, as smaller companies can use a simpler solution so they don't really need to pay for the technology.
> Currently, ES and Solr, both based on Lucene, can't really manage vector representations, as they are mainly based on inverted indexes to n-grams.
ES has kNN plugin, which stores vectors separately in each segment in Lucene index. Plus, they can also use better storage formats and algorithms.