Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My website contains millions of pages. It's not hard to notice the difference between a bot (or network) that wants to access all pages and a regular user.


Oh you will not notice. The pages can easily be spread out between residential IPs using headless browsers (masked as real ones), unless you really pay attention you won't see the ones that want to hide.


Every single argument against Cloudflare's features highlights exactly why people use Cloudflare so much.

You're talking about people setting up a botnet in order to scrape every scrap of data they can off of every website they touch. Why on earth would anyone be okay with such parasitic behaviors?


That's the thing, CF ain't gonna protect you against that. You need to consider actual access controls to actually restrict access.

Otherwise you're blaming people of using the data you've published, so what if they do?


How many scrapers are sophisticated enough to go this far though? Most of them are probably of bad quality and can be detected.


Why would those sophisticated enough to go that far, be of low quality


Unless they are scraping it using residential botnet proxies, unique user-agents, unique device types, and etc...


How often are the bots indexing it?


If you listen to the people complaining about bots at the moment, some bots are scraping the same pages over and over to the tune of terabytes per day because the bot operators have unlimited money and their targets don't.


> because the bot operators have unlimited money

I rather think the cause is that inbound bandwidth is usually free, so they need maybe 1/100th of the money because requests are smaller than responses (plus discounts they get for being big customers)


> I rather think the cause is that inbound bandwidth is usually free, so they need maybe 1/100th of the money because requests are smaller than responses (plus discounts they get for being big customers)

Seems like there's the potential to take advantage of this for a semi-custom protocol, if there's a desire to balance costs for serving data while still making things available to end users. We'd have the server reply to the initial request with a new HTTP response instructing the client to re-request with a POST containing an N-byte (N = data size) one-time pad. The client can receive this, generate random data (or all zeros, up to the client); and the server then will send the actual response XOR'd with the one-time pad.

Upside: Most end users don't pay for upload; if bot operators do, this incurs a dollar cost only to them. Downside: Increased download cost for the web site operator (but we've postulated that this is small compared to upload cost), extra round trip, extra time for each request (especially for end users with asymmetric bandwidth).

Eh, just a thought.


May work for small pages, like most of my webpages besides some downloadable files, but megabytes of JavaScript on an average (mobile?) connection are going to take very significantly longer to load, cost more battery, and take twice as much from your data bundle

Perhaps it's effective as bot deterrent when someone incurs, say, a ten times higher than median load (as measured in something like CPU time per hour or bandwidth per week or so). It will not prevent anyone from seeing your pages so information is still free, but it levels the playing field -- at least, for those with free inbound bandwidth dealing with bots that pay for outgoing bandwidth


> because the bot operators have unlimited money and their targets don't.

wget/curl vs django/rails, who wins?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: