Map/Reduce in Bash

gfodor · on Aug 15, 2010

a cool hack, but a large part of what makes map reduce map reduce isn't just the "map" and "reduce" but all the necessary fault tolerance features to ensure that when you run a job that's computable, it damn well is going to finish even if a meteor takes out half your data center.

saurik · on Aug 15, 2010

That is not difficult to handle in bash. The expectation going in should be that individual computations and even computers might fail in ways that keep them from reporting failure, at which point the fact that it is written in a language running on a VM with a poor memory manager (as it really wasn't designed for this kind of stress, and rapidly dominates RAM and fragments its heap to hell) is not a problem. (This happens to be a sore point of mine, as I had a rather unenlightened professor in grad school who refused to believe I could effectively solve this one massive distributed problem that came up with a "simple shell script for doing the distribution", giving me a bad grade in his class and never forgiving me for having made the claim. This being despite the fact that I then sent him the finished result three days later, having narrowed the search space with some light intelligence enough to have gotten the same answer his system took a week to do on better hardware, and his beloved not-a-shell-script. sigh ;P)

gfodor · on Aug 16, 2010

It seems easier said than done, unless you think Hadoop's architecture and fault tolerance logic is over-engineered.

timr · on Aug 15, 2010

Eh. That's a bit like saying that a web server isn't a web server unless it supports the various bells and whistles provided by Apache.

Most people don't have a gigantic cluster of computers. Their fault-tolerance needs are minimal, because the most likely point of failure is their own code. In that situation, a minimalist map/reduce implementation can be a good compromise between scale and complexity. And let's face it: for all of the internet hysteria over map/reduce, it's a very simple idea.

gcb · on Aug 16, 2010

just like you would assume the job a web proxy does is manage files in memory or disk if you only experience was with squid? riiiiiiigth.

mark_l_watson · on Aug 15, 2010

Interesting, but considering the rich set of tools growing up around Hadoop infrastructure (e.g., Mahoot, Cascading, etc.) I think that it makes much more sense to scale out horizontally using Hadoop and related technologies.

chuhnk · on Aug 15, 2010

Very cool, something I'll try utilitize internally for processing large files.

wazoox · on Aug 15, 2010

I don't know if this really qualify as "map-reduce". It's very cool, anyway :)

stavros · on Aug 15, 2010

Why not? There appears to be both a mapper and a reducer, so...

btilly · on Aug 15, 2010

There is both a mapper and a reducer internally, but a real map-reduce implementation should let you run custom code on both the map and the reduce, and a full-featured one should let you independently specify how many nodes you have mapping and reducing.

I may have missed it, but I don't see those things.

(I'm ignoring the fact that it is nice to be able to pass around arbitrary data structures. While true, I wouldn't expect that out of a bash implementation.)

adamtj · on Aug 15, 2010

map/reduce is a very simple, very general idea. It is also Google's and Hadoop's specific implementations with lots of bells and whistles. It's also everything in between. This implementation has more than the minimum two necessary features. If you are arguing that it isn't a true map/reduce implementation, then I would argue that you aren't a true software developer. http://en.wikipedia.org/wiki/No_true_scotsman

btilly · on Aug 15, 2010

The idea of MapReduce was introduced to the world in the paper http://labs.google.com/papers/mapreduce.html. The abstract starts off with MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

If you lack that feature set, then you can't solve most of the problems that MapReduce is actually used for in practice. Conversely if you have that feature set, then you can, though not necessarily with good reliability or at optimal speed. Therefore that description is what I take as the canonical description of what it means to have implemented MapReduce.

As for your arguing that I am not a true software developer, guilty as charged. I've been joking for years that I am merely a profitably displaced mathematician, and my current job role is a weird cross between software development and system administration.

gcb · on Aug 16, 2010

heh. considering that a shell by definition must rely on other programs to do something, a simpler implementation can be:

#!/bin/sh /bin/hadoop start-all

gcb · on Aug 16, 2010

...also close your eyes and imagine a line break there.