Hello,
I want to implement a cluster that will run MPI EXEs (programs that written in MPI - like cxxpi.exe). The cluster will have a mangement layer (written in C#) for connection/disconnection of clients to/from the cluster.
I have a couple of questions regarding the implementation:
1. I want that the cluster will be able to continue working on a job, even if computers are disconnected from the network. As I understood - MPI is not error tolerance, which means that disconnection of a client will cause the job to be terminated. Is there a built-in solution for that?
2. What about if I want to change the participating hosts? I mean - If a client has disconnected and another one is connected - I want that the new one will replace the first one in one of the jobs. Is this possible?
3. Another issue is the need to install a windows service (the cluster will run only on Windows platform) - like DeinoMPI has and like MPICH has. Do I have to install the service? All I want is that my management C# program to run on the server - and similiar exe on the client computers.
4. If I want that my C# management program will communicate with the clients running the MPI jobs (to get progress info and other stuff), what are my options? I mean that I run a MPI program (for example cxxpi.exe) with mpiexec - it's detached from the caller and he can't communicate with it anymore...
Thank you very much for helping.
Meni.