PageRank was an innovative idea in the early days of the Internet when trust was high, but yes it's absolutely gamed now and I would be surprised if Google still relies on it.
Fair play to them though, it enabled them to build a massive business.
Though I'd think that you'd want to weight unaffiliated sites' anchor text to a given URL much higher than an affiliated site.
"Affiliation" is a tricky term itself. Content farms were popular in the aughts (though they seem to have largely subsided), firms such as Claria and Gator. There are chumboxes (Outbrain, Taboola), and of course affiliate links (e.g., to Amazon or other shopping sites). SEO manipulation is its own whole universe.
(I'm sure you know far more about this than I do, I'm mostly talking at other readers, and maybe hoping to glean some more wisdom from you ;-)
Oh yeah, there's definitely room for improvement in that general direction. Indexing anchor texts is much better than page rank, but in isolation, it's not sufficient.
I've also seen some benefit fingerpinting the network traffic the websites make using a headless browser, to identify which ad networks they load. Very few spam sites have no ads, since there wouldn't be any economy in that.
The full data set of DOM samples + recorded network traffic are in an enormous sqlite file (400GB+), and I haven't yet worked out any way of distributing the data yet. Though it's in the back of my mind as something I'd like to solve.
I'd also suspect that there are networks / links which are more likely signs of low-value content than others. Off the top of my head, crypto, MLM, known scam/fraud sites, and perhaps share links to certain social networks might be negative indicators.
You can actually identify clusters of websites based on the cosine similarity of their outbound links. Pretty useful for identifying content farms spanning multiple websites.
Google’s biggest search signal now is aggregate behavioral data reported from Chrome. That pervasive behavioral surveillance is the main reason Apple has never allowed a native Chrome app on iOS.
It’s also why it is so hard to compete with Google. You guys are talking about techniques for analyzing the corpus of the search index. Google does that and has a direct view into how millions of people interact with it.
Yes indeed, they have an impossibly deep moat and deeper pockets. I'm certainly not trying to compete with them with my little side project, it's just for fun!
> That pervasive behavioral surveillance is the main reason Apple has never allowed a native Chrome app on iOS.
There is a native Chrome app on iOS. It gets all the same url visit data as Chrome on other platforms.
Apple blocks 3rd party renderers and JS engines on iOS to protect its App Store from competition that might deliver software and content through other channels that they don't take a cut of.
Fair play to them though, it enabled them to build a massive business.