I still think this could worthwhile though for these reasons.
- One "quality" poisoned document may be able to do more damage
- Many crawlers will be getting this poison, so this multiplies the effect by a lot
- The cost of generation seems to be much below market value at the moment
I didn't run the text generator in real time (that would defeat the point of shifting cost to the adversary, wouldn't it?). I created and cached a corpus, and then selectively made small edits (primarily URL rewriting) on the way out.
To generate garbage data I've had good success using Markov Chains in the past. These days I think I'd try an LLM and turning up the "heat".