Collaborative filtering is a boring problem and doesn't get to the heart of what's wrong with Reddit, Hacker News, and such.
For one thing, many good stories languish on the "new" page and never get enough votes to get a fair shake. Collaborative filtering doesn't help with this, if anything it makes it worse.
Last night I made a crude boomerang by glueing two rulers together, this morning it had set and my son pressured me to try throwing it before I'd even finished my breakfast. Right when it started to curve, it hit a telephone pole and broke at the glue joint.
When I see many of the things people want to do on reddit, my first impression is it will wind up like that. For instance, LSI is one of those things that does not work so well in real life... They still seem to be teaching kids about it, but not that you get results almost good doing dimensional reduction with a random basis set.
If you've got some semantic analysis and predictive models, you can make an automated system that picks quality relevant content out of the "new" queue and because you can use smart feature selection you don't need to wrangle as much data -- training is orders of magnitude faster and you don't need to futz around with hadoop.
For one thing, many good stories languish on the "new" page and never get enough votes to get a fair shake. Collaborative filtering doesn't help with this, if anything it makes it worse.
Last night I made a crude boomerang by glueing two rulers together, this morning it had set and my son pressured me to try throwing it before I'd even finished my breakfast. Right when it started to curve, it hit a telephone pole and broke at the glue joint.
When I see many of the things people want to do on reddit, my first impression is it will wind up like that. For instance, LSI is one of those things that does not work so well in real life... They still seem to be teaching kids about it, but not that you get results almost good doing dimensional reduction with a random basis set.
If you've got some semantic analysis and predictive models, you can make an automated system that picks quality relevant content out of the "new" queue and because you can use smart feature selection you don't need to wrangle as much data -- training is orders of magnitude faster and you don't need to futz around with hadoop.