Implement MPIEngine render engine #442

jacklovell · 2025-03-13T17:13:18Z

This render engine uses SPMD (single process, multiple data) to perform renders in parallel, using MPI (message passing interface). The paradigm is that multiple independent processes are launched, all of which build their own copy of the scenegraph in parallel and all generate rendering tasks. Each process with rank > 0 then works on a subset of those tasks and sends the results to the rank = 0 process, which processes the results. After the render is complete, only process 0 has the complete results.

MPI communications are blocking (send/recv rather than isend/irecv) and use mpi4py's methods for general Python objects, rather than the high performance equivalents utilising the buffer
protocol (Send/Recv). The latter is done to avoid loss of generality of the render engine: it supports any function supported by the other render engines, not just ones returning objects supporting the buffer protocol. The use of blocking comms simplifies the implementation and in testing (using the MPI variant of the raysect logo demo) the MPI comms were a small fraction - < 5% - of the total run time, with the vast majority of the time - > 85 % - spent in the render function itself. So the simplicity of implementation does not have a significant performance cost.

Tasks are distributed approximately equally among all workers at startup, with no adaptive scheduling. Again, this is done for simplicity of implementation as an adaptive scheduler would need some way of tracking how long each process was spending on tasks and all processes would need to communicate with one another about which tasks they were taking on. Adding an adaptive scheduler could be left to future work, but in testing any uneven runtime caused by the naive task distribution seemed to only have a small effect on the total run time, so the simplicity of the current solution is advantageous.

One other advantage of this render engine is that it doesn't require the fork() semantics of Linux for efficient sharing of the scenegraph between processes. It'll therefore enable efficient parallel processing on systems which don't implement fork(), such as Windows. This is partly why I named it MPIEngine rather than ClusterEngine, as it's useful on more than just distributed memory systems.

This is branched off from master (v0.8.1) and doesn't have any of the refactor in the feature/ClusterEngine branch, as that was the quickest way to get something prototyped. I'm opening the PR for early feedback before worrying too much about bringing ClusterEngine (and development) up to date with the latest release and then adding this on top.

This render engine uses SPMD (single process, multiple data) to perform renders in parallel, using MPI (message passing interface). The paradigm is that multiple independent processes are launched, all of which build their own copy of the scenegraph in parallel and all generate rendering tasks. Each process with rank > 0 then works on a subset of those tasks and sends the results to the rank = 0 process, which processes the results. After the render is complete, only process 0 has the complete results. MPI communications are blocking (send/recv rather than isend/irecv) and use mpi4py's methods for general Python objects, rather than the high performance equivalents utilising the buffer protocol (Send/Recv). The latter is done to avoid loss of generality of the render engine: it supports any function supported by the other render engines, not just ones returning objects supporting the buffer protocol. The use of blocking comms simplifies the implementation and in testing (using the MPI variant of the raysect logo demo) the MPI comms were a small fraction - < 5% - of the total run time, with the vast majority of the time - > 85 % - spent in the render function itself. So the simplicity of implementation does not have a significant performance cost. Tasks are distributed approximately equally among all workers at startup, with no adaptive scheduling. Again, this is done for simplicity of implementation as an adaptive scheduler would need some way of tracking how long each process was spending on tasks and all processes would need to communicate with one another about which tasks they were taking on. Adding an adaptive scheduler could be left to future work, but in testing any uneven runtime caused by the naive task distribution seemed to only have a small effect on the total run time, so the simplicity of the current solution is advantageous.

vsnever · 2025-03-14T00:30:09Z

Hi @jacklovell,
I also think that MPIEngine is a must in Raysect. However I think that a memory-effective MPIEngine should use both MPI and multiprocessing. Since each MPI process is independent, it creates a copy of a scenegraph even if executed on the same computing node with other processes. On a typical HPC computing node with 128 GB of RAM, 48 CPU cores and one MPI process per CPU core, this limits the size of a scenegraph to 2.66 GB. Detailed CAD models take up several times more RAM than that. I think that MPIEngine should use a hybrid scheme: one MPI process per computing node, which spawns multiple python processes using multiprocessing with fork semantics. We cannot share the scene graph between compute nodes, but at least within one compute node, we can use shared memory. Fortunately, SLURM and other resource managers allow to define such job configuration. For example if we want to run a Raysect job on 4 computing nodes with 48 CPUs each we should specify:

#SBATCH --ntasks 4
#SBATCH --cpus-per-task 48
#SBATCH --ntasks-per-node 1
#SBATCH --exclusive

Another point is blocking communications. In the current implementation, the root process does not render anything because it is waiting for other processes to send results. If a hybrid model (MPI + multiprocessing) is implemented, this would block the entire computing node from rendering. With non-blocking communications the root process can perform rendering just like any other process. In this case, the root process:

creates a list of non-blocking receive requests,
starts rendering,
tests the status of these requests before update,
gets the results from the completed ones
runs the update with these results,
removes the completed receive requests from the list and adds the new ones.

Other MPI processes must also maintain lists of send requests, periodically checking their status and removing completed ones.

I realize this is a complex solution, but so far I can't think of another one that would efficiently use the memory of the compute nodes while completely masking the data transfer with computation.

jacklovell · 2025-03-14T18:30:18Z

Yeah, fair point about the memory usage. I guess I could have added that to this bit of the docstring:

This engine is useful for distributed memory systems, and for shared
memory systems where the overhead of inter-process communication of
the scenegraph is large compared with the time taken for a single
process to produce the scenegraph (e.g. on Windows).

One way around this would be to only load the subset of the mesh that is actually relevant to the observer. The other option is to spread the workload over more nodes and treat the problem as memory-constrained rather than CPU-core-constrained.

I'm a bit wary of adding the current MulticoreEngine implementation into this, as it'll bring with it all the same issues that render engine has: incompatibility with Windows, potential and hard-to-debug incompatibility with libraries which use threads (and MacOS > 10.13) and a reliance on a continuing use of fork-without-exec in multiprocessing. Python 3.14 is already switching away from fork-without-exec as the default and it's heavily discouraged now, so I didn't want to include it in a new implementation (I wouldn't be surprised if they deprecated and eventually removed fork as a multiprocessing start method altogether).

I can see the advantage of an MPI+multithreading hybrid model, but MPI+multiprocessing to me seems like it would have too large a surface area for hard-to-debug problems. Bear in mind the threading incompatibility requires intervention for each individual application, it's not something a library like Raysect can guard against. Once a MultithreadEngine, using multithreading rather than multiprocessing, is available, this would be worth incorporating into a

This render engine uses a combination of distributed and shared memory tools to perform parallel rendering. MPI (mesage passing interface) is used to communicate between processes with separate memory, for example on different nodes of a compute cluster. Within each MPI process a secondary render engine is used for shared memory computation. The secondary (or sub) render engine may be any other render engine. `SerialEngine` and `MulticoreEngine` are currently supported, though the former is less efficient than using the MPIEngine due to additional IPC overhead. The engine+subengine architecture leaves room to expand to additional engines in future, such as a `MultiThreadEngine` in a free-threaded (no GIL) Python build. Gathering of render results is done in the main process on MPI rank 0. A sub process is spawned on rank 0 in which the sub render engine runs. For this to work efficiently the operating system must support cheap process forking without memcopy. At present only Linux is known to support this properly. MacOS may work if you're lucky. This limitation means the render engine is not truly cross platform: if cross-platform compatibility is required then SerialEngine and MPIEngine are still the only suitable choices. Ideally the result gathering and updates should not require spawning a new process: a separate thread would make more logical sense. But doing it in the same thread as the sub-engine is run causes resource contention and therefore significant performance degradation. Suggestions for fixing this would be most helpful. Hybrid MPI+multiprocessing has many gotchas, and care must be taken when launching the program especially with a job scheduler. The documentation includes some hints about how the program should be run with a few popular schedulers, and there is also a static helper method for getting an appropriate number of processes for a parallel sub worker. But ultimately it's the responsibility of the end user to configure the engine and start the job properly: the render engine itself does very little of this automatically. 2 new demos have been added showcasing this new render engine: one to render the Raysect logo and one doing a high fidelity, multi-pass adaptive sampling render of the Cornell Box.

jacklovell · 2025-04-01T17:22:56Z

I've made an additional render engine, HybridEngine, in the latest commit. This does combine MPI with some other node-local render engine, so you can combine distributed memory across multiple cluster nodes using MPI with shared memory on each node using multiprocessing. This does require more care on the part of the end user to set up and run their application properly, but it solves the memory efficiency problem on systems where fork-without-exec works properly and in a memory-efficient manner.

I propose keeping the simpler MPIEngine too, as this is the only cross-platform way to run parallel renders. And for scenes without a large memory footprint the simplicity of using it makes it attractive compared with the hybrid engine. But now end users have the choice of either.

jacklovell assigned CnlPepper Mar 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement MPIEngine render engine #442

Implement MPIEngine render engine #442

Uh oh!

jacklovell commented Mar 13, 2025

Uh oh!

vsnever commented Mar 14, 2025 •

edited

Loading

Uh oh!

jacklovell commented Mar 14, 2025

Uh oh!

jacklovell commented Apr 1, 2025

Uh oh!

Uh oh!

Implement MPIEngine render engine #442

Are you sure you want to change the base?

Implement MPIEngine render engine #442

Uh oh!

Conversation

jacklovell commented Mar 13, 2025

Uh oh!

vsnever commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacklovell commented Mar 14, 2025

Uh oh!

jacklovell commented Apr 1, 2025

Uh oh!

Uh oh!

vsnever commented Mar 14, 2025 •

edited

Loading