Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is indeed a huge waste to scrape the same whole site for changes and new content. If Cloudflare is capable to maintain an overview about changes and updates it could save a lot of resources.

The site could tell cloudflare directly what changed and cloudflare could tell the AI. The AI buys the changes and cloudflare pays the site keeps a margin.



The sitemap.xml spec already has fields for indicating the last time a page was changed and how often it's expected to change in the future, so that search engines can optimize their updates accordingly, but AI scrapers tend to disregard that and just download the same unchanged page 10,000 times for the hell of it.


> sitemap.xml spec already has fields for indicating the last time a page was changed

I did not know that bit! I'm considering adding this to my site now, because it sounds like it would save a lot of resources for everyone. Do (m)any crawlers use this information in your experience?


https://developers.google.com/search/docs/crawling-indexing/...

Google ignores the priority and change-frequency fields, but they do use the last-modified field to skip pulling pages which haven't changed since their crawler last visited. Not sure exactly which signals Bing uses but they definitely use last-modified as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: