-
Notifications
You must be signed in to change notification settings - Fork 24
Implement MPIEngine render engine #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This render engine uses SPMD (single process, multiple data) to perform renders in parallel, using MPI (message passing interface). The paradigm is that multiple independent processes are launched, all of which build their own copy of the scenegraph in parallel and all generate rendering tasks. Each process with rank > 0 then works on a subset of those tasks and sends the results to the rank = 0 process, which processes the results. After the render is complete, only process 0 has the complete results. MPI communications are blocking (send/recv rather than isend/irecv) and use mpi4py's methods for general Python objects, rather than the high performance equivalents utilising the buffer protocol (Send/Recv). The latter is done to avoid loss of generality of the render engine: it supports any function supported by the other render engines, not just ones returning objects supporting the buffer protocol. The use of blocking comms simplifies the implementation and in testing (using the MPI variant of the raysect logo demo) the MPI comms were a small fraction - < 5% - of the total run time, with the vast majority of the time - > 85 % - spent in the render function itself. So the simplicity of implementation does not have a significant performance cost. Tasks are distributed approximately equally among all workers at startup, with no adaptive scheduling. Again, this is done for simplicity of implementation as an adaptive scheduler would need some way of tracking how long each process was spending on tasks and all processes would need to communicate with one another about which tasks they were taking on. Adding an adaptive scheduler could be left to future work, but in testing any uneven runtime caused by the naive task distribution seemed to only have a small effect on the total run time, so the simplicity of the current solution is advantageous.
Hi @jacklovell, #SBATCH --ntasks 4
#SBATCH --cpus-per-task 48
#SBATCH --ntasks-per-node 1
#SBATCH --exclusive Another point is blocking communications. In the current implementation, the root process does not render anything because it is waiting for other processes to send results. If a hybrid model (MPI +
Other MPI processes must also maintain lists of send requests, periodically checking their status and removing completed ones. I realize this is a complex solution, but so far I can't think of another one that would efficiently use the memory of the compute nodes while completely masking the data transfer with computation. |
Yeah, fair point about the memory usage. I guess I could have added that to this bit of the docstring:
One way around this would be to only load the subset of the mesh that is actually relevant to the observer. The other option is to spread the workload over more nodes and treat the problem as memory-constrained rather than CPU-core-constrained. I'm a bit wary of adding the current MulticoreEngine implementation into this, as it'll bring with it all the same issues that render engine has: incompatibility with Windows, potential and hard-to-debug incompatibility with libraries which use threads (and MacOS > 10.13) and a reliance on a continuing use of fork-without-exec in multiprocessing. Python 3.14 is already switching away from fork-without-exec as the default and it's heavily discouraged now, so I didn't want to include it in a new implementation (I wouldn't be surprised if they deprecated and eventually removed I can see the advantage of an MPI+multithreading hybrid model, but MPI+multiprocessing to me seems like it would have too large a surface area for hard-to-debug problems. Bear in mind the threading incompatibility requires intervention for each individual application, it's not something a library like Raysect can guard against. Once a MultithreadEngine, using multithreading rather than multiprocessing, is available, this would be worth incorporating into a |
This render engine uses a combination of distributed and shared memory tools to perform parallel rendering. MPI (mesage passing interface) is used to communicate between processes with separate memory, for example on different nodes of a compute cluster. Within each MPI process a secondary render engine is used for shared memory computation. The secondary (or sub) render engine may be any other render engine. `SerialEngine` and `MulticoreEngine` are currently supported, though the former is less efficient than using the MPIEngine due to additional IPC overhead. The engine+subengine architecture leaves room to expand to additional engines in future, such as a `MultiThreadEngine` in a free-threaded (no GIL) Python build. Gathering of render results is done in the main process on MPI rank 0. A sub process is spawned on rank 0 in which the sub render engine runs. For this to work efficiently the operating system must support cheap process forking without memcopy. At present only Linux is known to support this properly. MacOS may work if you're lucky. This limitation means the render engine is not truly cross platform: if cross-platform compatibility is required then SerialEngine and MPIEngine are still the only suitable choices. Ideally the result gathering and updates should not require spawning a new process: a separate thread would make more logical sense. But doing it in the same thread as the sub-engine is run causes resource contention and therefore significant performance degradation. Suggestions for fixing this would be most helpful. Hybrid MPI+multiprocessing has many gotchas, and care must be taken when launching the program especially with a job scheduler. The documentation includes some hints about how the program should be run with a few popular schedulers, and there is also a static helper method for getting an appropriate number of processes for a parallel sub worker. But ultimately it's the responsibility of the end user to configure the engine and start the job properly: the render engine itself does very little of this automatically. 2 new demos have been added showcasing this new render engine: one to render the Raysect logo and one doing a high fidelity, multi-pass adaptive sampling render of the Cornell Box.
I've made an additional render engine, I propose keeping the simpler |
This render engine uses SPMD (single process, multiple data) to perform renders in parallel, using MPI (message passing interface). The paradigm is that multiple independent processes are launched, all of which build their own copy of the scenegraph in parallel and all generate rendering tasks. Each process with rank > 0 then works on a subset of those tasks and sends the results to the rank = 0 process, which processes the results. After the render is complete, only process 0 has the complete results.
MPI communications are blocking (send/recv rather than isend/irecv) and use mpi4py's methods for general Python objects, rather than the high performance equivalents utilising the buffer
protocol (Send/Recv). The latter is done to avoid loss of generality of the render engine: it supports any function supported by the other render engines, not just ones returning objects supporting the buffer protocol. The use of blocking comms simplifies the implementation and in testing (using the MPI variant of the raysect logo demo) the MPI comms were a small fraction - < 5% - of the total run time, with the vast majority of the time - > 85 % - spent in the render function itself. So the simplicity of implementation does not have a significant performance cost.
Tasks are distributed approximately equally among all workers at startup, with no adaptive scheduling. Again, this is done for simplicity of implementation as an adaptive scheduler would need some way of tracking how long each process was spending on tasks and all processes would need to communicate with one another about which tasks they were taking on. Adding an adaptive scheduler could be left to future work, but in testing any uneven runtime caused by the naive task distribution seemed to only have a small effect on the total run time, so the simplicity of the current solution is advantageous.
One other advantage of this render engine is that it doesn't require the fork() semantics of Linux for efficient sharing of the scenegraph between processes. It'll therefore enable efficient parallel processing on systems which don't implement fork(), such as Windows. This is partly why I named it MPIEngine rather than ClusterEngine, as it's useful on more than just distributed memory systems.
This is branched off from master (v0.8.1) and doesn't have any of the refactor in the feature/ClusterEngine branch, as that was the quickest way to get something prototyped. I'm opening the PR for early feedback before worrying too much about bringing ClusterEngine (and development) up to date with the latest release and then adding this on top.