Tuesday, October 28, 2008

Resilient Overlay Networks

RON is a fix for distributed Internet applications to detect and recover from path outages and degraded performance. The main goal of RON is to enable a group of nodes to communicate with each other in the face of problems in the underlying Internet paths. This group of nodes are connected to each other in the RON. They aggressively calculate the quality of the paths and exchange information among them using link-state routing. Path metrics are obtained using active probing as well as passive measurements.

I liked the fact that RON gives applications control over the routing protocol. Applications can register RON router and RON membership manager that implement the routing protocol and the consequent decisions. This spirit is similar to the class project we are working on – ask the needs of the application and take decisions on network interfaces accordingly. While this might not be ideal for every single application, the ceding of control is a fundamental shift; applications that do not want it can use a default module. That said I am not a fan of frameworks/architectures that suggest that applications have to be re-written. Also, an interesting extension would be to offer applications the need to monitor custom parameters and not just use throughput, latency etc. that are available.

Their motivation scenario where an ISP deploys a RON to provide better customer services seemed compelling. I wonder if ISPs actually use it – unlike many other motivating scenarios for overlay networks, this one is actually motivating! :-) But I think they blame BGP a little too much - it's the mess that BGP gets in with policies that results in its sorry performance sometimes. It is not clear how RON interacts with BGP policies...

The scalability aspect (or the lack of it) raises some questions. What happens if RON actually becomes very popular and everybody deploys one? It is not clear if aggressive probing by the numerous RONs would do the network any good. I suspect not but an evaluation that points out the cases when it won’t would be nice to see. Another aspect is fairness - isn't RON being unfair to non-RON users by being very aggressive?

Their result that two-hop paths are good enough to route across most outages seems surprising and non-obvious. I generally prefer identifying worst-case scenarios too where the system would not work.

Overall, I have no doubt that this paper should be retained! It was a good read and introduces many important points and concepts. I get a feeling that most of these important papers that we are reading are out of MIT and that is noteworthy ;-)

No comments: