You're correct, the Python version is O(N) instead of O(1) in RAM usage for two ...

websiteguy · on July 28, 2012

Unix sort is a merge sort, using fixed RAM uniq -c is line by line sed is line by line

A superior solution in every way

If you were going to do this in Python (or another similar language), you need to write a sort function that operates on a file, not a Python collection, as the collection is always bound by RAM, which would at best be re-implementing the sort command. Once you can sort a file, everything else is trivial, as counts can be done line by line, or more siply, via uniq -c.

Of course, if you only care about things that fit in memory, you can do it in Python, but it is still far easier to use command line for these type of problems.

pdonis · on July 29, 2012

> Of course, if you only care about things that fit in memory, you can do it in Python, but it is still far easier to use command line for these type of problems.

I agree; I was not trying to claim that my Python solution should be used in preference to the shell pipeline solution in any kind of "production" environment. As I noted in another comment, Knuth's Pascal solution appears to be open to the same criticism.

pdonis · on July 28, 2012

Looking at the original paper that was linked to elsewhere in the comments here, it appears that Knuth's original Pascal program is, like my Python version, O(N) in RAM. It keeps everything in memory and uses no temporary files, as far as I can tell.