Monday, July 11, 2005

Distributed Groove

I spent my day troubleshooting JGroups. For those unfamiliar, JGroups is a open source middleware system. The problem was that sometimes when a client system is restarted, the client would fail to see the cluster and instead become a coordinator of its own cluster. After a bit of research, the problem ended up being connected with long garbage collection delays on the real coordinator. This caused the server to not respond to heart beat requests, which in turn caused the client to think that it was alone and therefore become its own coordinator in its own cluster.

To solve the problem, I started looking through the JGroups source code and I stumbled upon a reference to the Lamport timestamp algorithm. I remember studing the algorithm in school. The basic premise is an ability to understand logical order of messages in a distributed environment based on a concept of time. His paper goes into quite a bit of detail, some of it awfully trivial, and other awfully complicated. This brought me to his website, where I discovered a stock pile of research papers.

He covers a lot of very interesting concepts, most of it some what over my heard. The bizantine systems are very interesting: the ability to write systems that can react to any type of error. He also goes in to describe a truly parallel garbage collection algorithm, something that would be quite nice in Java.

It's very interesting. Most of the code being written these days, myself included, is written to get it out quickly, and cheaply. The code does what it supposed to do, but is by no means very efficient or bullet proof. It does the work, but is at best a temporary solution. Definately not elegant. Then I get to see all these research papers dealing explicitely with the elegance of programming. It is a nice feeling: raw computer science.

No comments: