Skip to content

Implement MPIEngine render engine #442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

jacklovell
Copy link
Contributor

This render engine uses SPMD (single process, multiple data) to perform renders in parallel, using MPI (message passing interface). The paradigm is that multiple independent processes are launched, all of which build their own copy of the scenegraph in parallel and all generate rendering tasks. Each process with rank > 0 then works on a subset of those tasks and sends the results to the rank = 0 process, which processes the results. After the render is complete, only process 0 has the complete results.

MPI communications are blocking (send/recv rather than isend/irecv) and use mpi4py's methods for general Python objects, rather than the high performance equivalents utilising the buffer
protocol (Send/Recv). The latter is done to avoid loss of generality of the render engine: it supports any function supported by the other render engines, not just ones returning objects supporting the buffer protocol. The use of blocking comms simplifies the implementation and in testing (using the MPI variant of the raysect logo demo) the MPI comms were a small fraction - < 5% - of the total run time, with the vast majority of the time - > 85 % - spent in the render function itself. So the simplicity of implementation does not have a significant performance cost.

Tasks are distributed approximately equally among all workers at startup, with no adaptive scheduling. Again, this is done for simplicity of implementation as an adaptive scheduler would need some way of tracking how long each process was spending on tasks and all processes would need to communicate with one another about which tasks they were taking on. Adding an adaptive scheduler could be left to future work, but in testing any uneven runtime caused by the naive task distribution seemed to only have a small effect on the total run time, so the simplicity of the current solution is advantageous.

One other advantage of this render engine is that it doesn't require the fork() semantics of Linux for efficient sharing of the scenegraph between processes. It'll therefore enable efficient parallel processing on systems which don't implement fork(), such as Windows. This is partly why I named it MPIEngine rather than ClusterEngine, as it's useful on more than just distributed memory systems.

This is branched off from master (v0.8.1) and doesn't have any of the refactor in the feature/ClusterEngine branch, as that was the quickest way to get something prototyped. I'm opening the PR for early feedback before worrying too much about bringing ClusterEngine (and development) up to date with the latest release and then adding this on top.

This render engine uses SPMD (single process, multiple data) to
perform renders in parallel, using MPI (message passing
interface). The paradigm is that multiple independent processes are
launched, all of which build their own copy of the scenegraph in
parallel and all generate rendering tasks. Each process with rank > 0
then works on a subset of those tasks and sends the results to the
rank = 0 process, which processes the results. After the render is
complete, only process 0 has the complete results.

MPI communications are blocking (send/recv rather than isend/irecv)
and use mpi4py's methods for general Python objects, rather than the
high performance equivalents utilising the buffer
protocol (Send/Recv). The latter is done to avoid loss of generality
of the render engine: it supports any function supported by the other
render engines, not just ones returning objects supporting the buffer
protocol. The use of blocking comms simplifies the implementation and
in testing (using the MPI variant of the raysect logo demo) the MPI
comms were a small fraction - < 5% - of the total run time, with the
vast majority of the time - > 85 % - spent in the render function
itself. So the simplicity of implementation does not have a
significant performance cost.

Tasks are distributed approximately equally among all workers at
startup, with no adaptive scheduling. Again, this is done for
simplicity of implementation as an adaptive scheduler would need some
way of tracking how long each process was spending on tasks and all
processes would need to communicate with one another about which tasks
they were taking on. Adding an adaptive scheduler could be left to
future work, but in testing any uneven runtime caused by the naive
task distribution seemed to only have a small effect on the total run
time, so the simplicity of the current solution is advantageous.
@vsnever
Copy link
Contributor

vsnever commented Mar 14, 2025

Hi @jacklovell,
I also think that MPIEngine is a must in Raysect. However I think that a memory-effective MPIEngine should use both MPI and multiprocessing. Since each MPI process is independent, it creates a copy of a scenegraph even if executed on the same computing node with other processes. On a typical HPC computing node with 128 GB of RAM, 48 CPU cores and one MPI process per CPU core, this limits the size of a scenegraph to 2.66 GB. Detailed CAD models take up several times more RAM than that. I think that MPIEngine should use a hybrid scheme: one MPI process per computing node, which spawns multiple python processes using multiprocessing with fork semantics. We cannot share the scene graph between compute nodes, but at least within one compute node, we can use shared memory. Fortunately, SLURM and other resource managers allow to define such job configuration. For example if we want to run a Raysect job on 4 computing nodes with 48 CPUs each we should specify:

#SBATCH --ntasks 4
#SBATCH --cpus-per-task 48
#SBATCH --ntasks-per-node 1
#SBATCH --exclusive

Another point is blocking communications. In the current implementation, the root process does not render anything because it is waiting for other processes to send results. If a hybrid model (MPI + multiprocessing) is implemented, this would block the entire computing node from rendering. With non-blocking communications the root process can perform rendering just like any other process. In this case, the root process:

  1. creates a list of non-blocking receive requests,
  2. starts rendering,
  3. tests the status of these requests before update,
  4. gets the results from the completed ones
  5. runs the update with these results,
  6. removes the completed receive requests from the list and adds the new ones.

Other MPI processes must also maintain lists of send requests, periodically checking their status and removing completed ones.

I realize this is a complex solution, but so far I can't think of another one that would efficiently use the memory of the compute nodes while completely masking the data transfer with computation.

@jacklovell
Copy link
Contributor Author

Yeah, fair point about the memory usage. I guess I could have added that to this bit of the docstring:

This engine is useful for distributed memory systems, and for shared
memory systems where the overhead of inter-process communication of
the scenegraph is large compared with the time taken for a single
process to produce the scenegraph (e.g. on Windows).

One way around this would be to only load the subset of the mesh that is actually relevant to the observer. The other option is to spread the workload over more nodes and treat the problem as memory-constrained rather than CPU-core-constrained.

I'm a bit wary of adding the current MulticoreEngine implementation into this, as it'll bring with it all the same issues that render engine has: incompatibility with Windows, potential and hard-to-debug incompatibility with libraries which use threads (and MacOS > 10.13) and a reliance on a continuing use of fork-without-exec in multiprocessing. Python 3.14 is already switching away from fork-without-exec as the default and it's heavily discouraged now, so I didn't want to include it in a new implementation (I wouldn't be surprised if they deprecated and eventually removed fork as a multiprocessing start method altogether).

I can see the advantage of an MPI+multithreading hybrid model, but MPI+multiprocessing to me seems like it would have too large a surface area for hard-to-debug problems. Bear in mind the threading incompatibility requires intervention for each individual application, it's not something a library like Raysect can guard against. Once a MultithreadEngine, using multithreading rather than multiprocessing, is available, this would be worth incorporating into a

This render engine uses a combination of distributed and shared memory
tools to perform parallel rendering. MPI (mesage passing interface) is
used to communicate between processes with separate memory, for
example on different nodes of a compute cluster. Within each MPI
process a secondary render engine is used for shared memory
computation.

The secondary (or sub) render engine may be any other render
engine. `SerialEngine` and `MulticoreEngine` are currently supported,
though the former is less efficient than using the MPIEngine due to
additional IPC overhead. The engine+subengine architecture leaves room
to expand to additional engines in future, such as a
`MultiThreadEngine` in a free-threaded (no GIL) Python build.

Gathering of render results is done in the main process on MPI rank
0. A sub process is spawned on rank 0 in which the sub render engine
runs. For this to work efficiently the operating system must support
cheap process forking without memcopy. At present only Linux is known
to support this properly. MacOS may work if you're lucky. This
limitation means the render engine is not truly cross platform: if
cross-platform compatibility is required then SerialEngine and
MPIEngine are still the only suitable choices.

Ideally the result gathering and updates should not require spawning a
new process: a separate thread would make more logical sense. But
doing it in the same thread as the sub-engine is run causes resource
contention and therefore significant performance degradation.
Suggestions for fixing this would be most helpful.

Hybrid MPI+multiprocessing has many gotchas, and care must be taken
when launching the program especially with a job scheduler. The
documentation includes some hints about how the program should be run
with a few popular schedulers, and there is also a static helper
method for getting an appropriate number of processes for a parallel
sub worker. But ultimately it's the responsibility of the end user to
configure the engine and start the job properly: the render engine
itself does very little of this automatically.

2 new demos have been added showcasing this new render engine: one to
render the Raysect logo and one doing a high fidelity, multi-pass
adaptive sampling render of the Cornell Box.
@jacklovell
Copy link
Contributor Author

I've made an additional render engine, HybridEngine, in the latest commit. This does combine MPI with some other node-local render engine, so you can combine distributed memory across multiple cluster nodes using MPI with shared memory on each node using multiprocessing. This does require more care on the part of the end user to set up and run their application properly, but it solves the memory efficiency problem on systems where fork-without-exec works properly and in a memory-efficient manner.

I propose keeping the simpler MPIEngine too, as this is the only cross-platform way to run parallel renders. And for scenes without a large memory footprint the simplicity of using it makes it attractive compared with the hybrid engine. But now end users have the choice of either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants