Yeah that's where I started out in 2021. Been at it for almost 5 years now, last three of which full time. I'm indexing about 1.1 billion documents now off a single server.
Hard part is doing it at any sort of scale and producing useful results. It's easy to build something that indexes a few million documents. Pushing into billions is a bigger challenge, as you start needing a lot of increasingly intricate bespoke solutions.
(... though it operates a bit sub-optimally now as I'm using a ton of CPU cores to migrate the index to use postings lists compression, will take about 4-5 days I think).
I'm mostly living off grants and donations at this point, but the plan down the line is to polish it up well enough to make some money off providing an API like the one Google is making it a hassle to access with this change :-)
Hard part is doing it at any sort of scale and producing useful results. It's easy to build something that indexes a few million documents. Pushing into billions is a bigger challenge, as you start needing a lot of increasingly intricate bespoke solutions.
Devlog here:
https://www.marginalia.nu/tags/search-engine/
And search engine itself:
https://marginalia-search.com/
(... though it operates a bit sub-optimally now as I'm using a ton of CPU cores to migrate the index to use postings lists compression, will take about 4-5 days I think).