Massive Distributed System!
843793Nov 1 2005 — edited Nov 1 2005As part of my research I have created this huge java distributed system for calculations.
The system is composed of a single "Sender" which sends jobs to a farm of "Calculators" running on about 1000 different host machines. The Calculators send their results to a single "Receiver". Jobs could be easy e.g. 1+1 or hard e.g. something like log(1/[cos(ln3)]). As a result some jobs finish in seconds while others may even take an hour.
I have implemented this system and it works fine. But I am sure there is room for improvement and optimisation.
It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing and dying while performing a calculation. How do I detect this? Shall I use the old heartbeat method? I don't think I can risk too many messages on the network.
more scalable E.g. what if I increase the number of hosts. Can RMI handle it or is there an alternative?
more adaptive to the type of job etc
Any other ideas or links to more info are welcome.
Thanks.