Andy and Matei's great work over the last year - if you are in the RAD Lab you just cannot have not heard this talk! Awesome work - smart problem selection and neat solution!
Hadoop is an open-source implementation of MapReduce. The implicit assumption is that nodes are homogeneous and increasingly this assumption is turning out to be wrong with environments like Amazon's Ec2 becoming popular. Hadoop deals with "stragglers" by starting their job on multiple machines and take results from the one (original or backup) that finished earliest. The authors argue successfully that this is not the way to go.
LATE (Longest Approximate Time to End) is their solution. It uses finishing times instead of progress rates in decisions. They use progress counters to figure out when each task will finish. They execute the farthest-to-finish tasks on idle fast machines.
This paper makes the assumption that all tasks take the same amount of time. A straggler might actually be executing a heavy task that is making it go slow...differentiating between this case and genuinely slow-working machines could be a nice principled future work. This might lead to an intelligent assignment mechanism that allocates tasks based on how heavy they are, and the capabilities of the machines.
Also, how much of this is a problem with Hadoop as opposed to Map-Reduce as such? The good question would be to ask, Are we solving Hadoop's problems or MapReduce's problems? Looks like Hadoop has made a set of sub-optimal assumptions and we should work more towards identifying general problems. That said, this probably shows the complexity in handling corner-cases with distributed systems...