Skip to content

Parallel post transform including write_doc_serialized #11746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ Contributors
* Lars Hupfeldt Nielsen - OpenSSL FIPS mode md5 bug fix
* Łukasz Langa -- partial support for autodoc
* Marco Buttu -- doctest extension (pyversion option)
* Marco Heinemann -- multiprocessing improvements
* Martin Hans -- autodoc improvements
* Martin Larralde -- additional napoleon admonitions
* Martin Mahner -- nature theme
Expand Down
5 changes: 5 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@ Features added
* #11981: Improve rendering of signatures using ``slice`` syntax,
e.g., ``def foo(arg: np.float64[:,:]) -> None: ...``.

* #10779 and #11448: Parallel execution of post-transformation and
write_doc_serialized as an experimental feature.
Speeds up builds featuring expensive post-transforms by a factor of at least 2.
Patch by Marco Heinemann.

Bugs fixed
----------

Expand Down
2 changes: 2 additions & 0 deletions doc/extdev/builderapi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Builder API
.. autoattribute:: supported_remote_images
.. autoattribute:: supported_data_uri_images
.. autoattribute:: default_translator_class
.. autoattribute:: post_transform_merge_attr

These methods are predefined and will be called from the application:

Expand All @@ -37,6 +38,7 @@ Builder API
.. automethod:: get_target_uri
.. automethod:: prepare_writing
.. automethod:: write_doc
.. automethod:: merge_builder_post_transform
.. automethod:: finish

**Attributes**
Expand Down
72 changes: 72 additions & 0 deletions doc/usage/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -788,6 +788,78 @@ General configuration

.. versionadded:: 5.1

.. confval:: enable_parallel_post_transform

Default is ``False``.
This experimental feature flag activates parallel post-transformation during
the parallel write phase. When active, the event :event:`doctree-resolved` as
well as the builder function ``write_doc_serialized`` also run in parallel.
Parallel post-transformation can greatly improve build time for extensions that
do heavy computation in that phase. Depending on machine core count and project
size, a build time reduction by a factor of 2 to 4 and even more was observed.
The feature flag does nothing in case parallel writing is not used.

*Background*

By default, if parallel writing is active (that is no extensions inhibits it by
their :ref:`metadata <ext-metadata>`) the following logic applies:

.. code-block:: text

For each chunk of docnames:
main process: post-transform including doctree-resolved, encapsulated by
BuildEnvironment.get_and_resolve_doctree()
main process: Builder.write_doc_serialized()
sub process: Builder.write_doc()

This means only the ``write_doc()`` function is executed in parallel. However,
the subprocess waits for the main process preparing the chunk. This is a
serious bottleneck that practically inhibits parallel execution when extensions
are used that do CPU intensive calculations during post-transformation.

Activating this feature flag changes the logic as follows:

.. code-block:: text

For each chunk of docnames:
sub process: post-transform including doctree-resolved, encapsulated by
BuildEnvironment.get_and_resolve_doctree()
sub process: Builder.write_doc_serialized()
sub process: Builder.write_doc()
sub process: pickle and return certain Builder attributes
main process: merge attributes back to main process builder

This effectively resolves the main process bottleneck as post-transformations
run in parallel now. The expected core logic for doctrees of
``post-transform > write_doc_serialized > write_doc`` is still intact. The
approach can however lead to issues in case extensions write to the environment
or the builder during the post-transformation phase or in
``write_doc_serialized`` and expect that information to be available after the
subprocess has ended. Each subprocess has a completely separated memory space
and it is lost when the process ends. For Sphinx builders and also custom
builders, specific attributes can be returned to the main process.
See below note for details.

.. note::
Be sure all active extensions support parallel post-transformation before
using this flag.

Extensions writing on :py:class:`sphinx.environment.BuildEnvironment` and
expecting the data to be available at a later build stage
(e.g. in :event:`build-finished`) are *not* supported.
For the builder object, a mechanism exists to return data to the main process:
The builder class may implement the attribute
:py:attr:`sphinx.builders.Builder.post_transform_merge_attr` to define a
list of attributes to be returned to the main process after parallel
post-transformation and writing. This data is passed to the builder method
:py:meth:`sphinx.builders.Builder.merge_builder_post_transform` to do the
actual merging. In case this is not enough for any of the active extensions,
the feature flag cannot be used.

.. versionadded:: 7.3

.. note:: This configuration is still experimental.

.. _intl-options:

Options for internationalization
Expand Down
85 changes: 76 additions & 9 deletions sphinx/builders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,14 @@ class Builder:
supported_remote_images = False
#: The builder supports data URIs or not.
supported_data_uri_images = False
#: Builder attributes that should be returned from parallel
#: post transformation, to be merged to the main builder in
#: :py:class:`~sphinx.builders.Builder.merge_builder_post_transform`.
#: Attributes in the list must be pickleable.
#: The approach improves performance when pickling and sending data
#: over pipes because only a subset of the builder attributes
#: are commonly needed for merging to the main process builder instance.
post_transform_merge_attr: tuple[str, ...] = ()

def __init__(self, app: Sphinx, env: BuildEnvironment) -> None:
self.srcdir = app.srcdir
Expand Down Expand Up @@ -125,6 +133,26 @@ def init(self) -> None:
"""
pass

def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None:
"""Merge post-transform data into the master builder.

This method allows builders to merge any post-transform information
coming from parallel subprocesses back into the builder in
the main process. This can be useful for extensions that consume
that information in the build-finish phase.
The function is called once for each finished subprocess.
Builders that implement this function must also define the class
attribute :py:attr:`~sphinx.builders.Builder.post_transform_merge_attr`
as it defines which builder attributes shall be returned to
the main process for merging.

The default implementation does nothing.

:param new_attrs: The attributes from the parallel subprocess to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be a dict or could you live with a Mapping (so that we really show that we do not act on the inputs).

updated in the main builder
"""
pass

def create_template_bridge(self) -> None:
"""Return the template bridge configured."""
if self.config.template_bridge:
Expand Down Expand Up @@ -564,7 +592,10 @@ def write(

if self.parallel_ok:
# number of subprocesses is parallel-1 because the main process
# is busy loading doctrees and doing write_doc_serialized()
# is busy loading and post-transforming doctrees and doing
# write_doc_serialized();
# in case the global configuration enable_parallel_post_transform
# is active the main process does nothing
self._write_parallel(sorted(docnames),
nproc=self.app.parallel - 1)
else:
Expand All @@ -581,11 +612,34 @@ def _write_serial(self, docnames: Sequence[str]) -> None:
self.write_doc(docname, doctree)

def _write_parallel(self, docnames: Sequence[str], nproc: int) -> None:
def write_process(docs: list[tuple[str, nodes.document]]) -> None:
def write_process_serial_post_transform(
docs: list[tuple[str, nodes.document]],
) -> None:
self.app.phase = BuildPhase.WRITING
# The doctree has been post-transformed (incl. write_doc_serialized)
# in the main process, only write_doc() is needed here
for docname, doctree in docs:
self.write_doc(docname, doctree)

def write_process_parallel_post_transform(docs: list[str]) -> bytes:
assert self.env.config.enable_parallel_post_transform
self.app.phase = BuildPhase.WRITING
# run post-transform, doctree-resolved and write_doc_serialized in parallel
for docname in docs:
doctree = self.env.get_and_resolve_doctree(docname, self)
# write_doc_serialized is assumed to be safe for all Sphinx
# internal builders. Some builders merge information from post-transform
# and write_doc_serialized back to the main process using
# Builder.post_transform_merge_attr and
# Builder.merge_builder_post_transform
self.write_doc_serialized(docname, doctree)
self.write_doc(docname, doctree)
merge_attr = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add (in the docstring or somewhere) that the attributes defined by post_transform_merge_attr should be picklable?

Copy link
Member

@chrisjsewell chrisjsewell Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a robust/flexible enough mechanism for merging data generated during post-processing:

  • It only accounts for data saved on to the Builder, not on to the Environment
  • It doesn't give any hook for extensions to merge data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be two clear "hook" points, that "anyone" (builders, extensions, etc) can use:

  1. A hook function/method where they can write logic to add data to the "entity" that will be pickled and returned from a single process
  2. A hook function/method where they can write logic to "merge" the pickled data back into the core process

Copy link
Author

@ubmarco ubmarco Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the approach is not super flexible. There are however some reasons why I went for it:

  1. Pickling big objects can be quite expensive.
  2. The environment is written to disc earlier in the process iirc. So what's the point of giving extensions an option to modify it?
  3. Iirc the built-in builders have attributes that are not picklable. Custom builders can easily have more of these problems.

In general I think the whole idea of sending huge objects through pipes for flexibility has too many downsides. It's slow, can break, is difficult to debug and hard to make right. I'm also worrying about the implications when moving to spawn as child start method (or going for libraries such as mpire).

I just went for the slim variant. Going forward it would make sense to offer an alternative approach to the whole multiprocessing mechanism. E.g sqlite as data store - for surgical modifications by extensions instead of ever growing Python objects.

attr: getattr(self, attr, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really return None if the attribute does not exist? shouldn't we raise an exception otherwise?

for attr in self.post_transform_merge_attr
}
return pickle.dumps(merge_attr, pickle.HIGHEST_PROTOCOL)

# warm up caches/compile templates using the first document
firstname, docnames = docnames[0], docnames[1:]
self.app.phase = BuildPhase.RESOLVING
Expand All @@ -598,21 +652,34 @@ def write_process(docs: list[tuple[str, nodes.document]]) -> None:
chunks = make_chunks(docnames, nproc)

# create a status_iterator to step progressbar after writing a document
# (see: ``on_chunk_done()`` function)
# (see: ``merge_builder()`` function)
progress = status_iterator(chunks, __('writing output... '), "darkgreen",
len(chunks), self.app.verbosity)

def on_chunk_done(args: list[tuple[str, NoneType]], result: NoneType) -> None:
next(progress)

def merge_builder(
args: list[tuple[str, NoneType]], pickled_new_attrs: bytes, /,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the args ? Why is the second item always None?

) -> None:
assert self.env.config.enable_parallel_post_transform
new_attrs: dict[str, Any] = pickle.loads(pickled_new_attrs)
self.merge_builder_post_transform(new_attrs)
next(progress)

self.app.phase = BuildPhase.RESOLVING
for chunk in chunks:
arg = []
for docname in chunk:
doctree = self.env.get_and_resolve_doctree(docname, self)
self.write_doc_serialized(docname, doctree)
arg.append((docname, doctree))
tasks.add_task(write_process, arg, on_chunk_done)
if self.env.config.enable_parallel_post_transform:
tasks.add_task(write_process_parallel_post_transform,
chunk, merge_builder)
else:
arg = []
for docname in chunk:
doctree = self.env.get_and_resolve_doctree(docname, self)
self.write_doc_serialized(docname, doctree)
arg.append((docname, doctree))
tasks.add_task(write_process_serial_post_transform,
arg, on_chunk_done)

# make sure all threads have finished
tasks.join()
Expand Down
12 changes: 12 additions & 0 deletions sphinx/builders/_epub_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ class EpubBuilder(StandaloneHTMLBuilder):
refuri_re = REFURI_RE
template_dir = ""
doctype = ""
post_transform_merge_attr = ('images',)

def init(self) -> None:
super().init()
Expand All @@ -167,6 +168,17 @@ def init(self) -> None:
self.use_index = self.get_builder_config('use_index', 'epub')
self.refnodes: list[dict[str, Any]] = []

def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, I think new_attrs are not meant to be writable so a mapping should be fine.

By the way, "new_attrs" is a bit confusing for me. Maybe "context" or "extras" could be a better name.

"""Merge images back to the main builder after parallel
post-transformation.

:param new_attrs: the attributes from the parallel subprocess to be
udpated in the main builder (self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
udpated in the main builder (self)
updated in the main builder (self)

"""
for filepath, filename in new_attrs['images'].items():
if filepath not in self.images:
self.images[filepath] = filename

def create_build_info(self) -> BuildInfo:
return BuildInfo(self.config, self.tags, frozenset({'html', 'epub'}))

Expand Down
30 changes: 29 additions & 1 deletion sphinx/builders/html/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
from sphinx.errors import ConfigError, ThemeError
from sphinx.highlighting import PygmentsBridge
from sphinx.locale import _, __
from sphinx.search import js_index
from sphinx.search import IndexBuilder, js_index
from sphinx.theming import HTMLThemeFactory
from sphinx.util import isurl, logging
from sphinx.util.display import progress_message, status_iterator
Expand Down Expand Up @@ -188,6 +188,7 @@ class StandaloneHTMLBuilder(Builder):

imgpath: str = ''
domain_indices: list[DOMAIN_INDEX_TYPE] = []
post_transform_merge_attr: tuple[str, ...] = ('images', 'indexer')

def __init__(self, app: Sphinx, env: BuildEnvironment) -> None:
super().__init__(app, env)
Expand All @@ -213,6 +214,7 @@ def __init__(self, app: Sphinx, env: BuildEnvironment) -> None:
op = pub.setup_option_parser(output_encoding='unicode', traceback=True)
pub.settings = op.get_default_values()
self._publisher = pub
self.indexer: IndexBuilder | None = None

def init(self) -> None:
self.build_info = self.create_build_info()
Expand Down Expand Up @@ -240,6 +242,32 @@ def init(self) -> None:

self.use_index = self.get_builder_config('use_index', 'html')

def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None:
"""Merge images and search indexer back to the main builder after parallel
post-transformation.

:param new_attrs: the attributes from the parallel subprocess to be
udpated in the main builder (self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
udpated in the main builder (self)
updated in the main builder (self)

"""
# handle indexer
if self.indexer is None:
lang = self.config.html_search_language or self.config.language
self.indexer = IndexBuilder(self.env, lang,
self.config.html_search_options,
self.config.html_search_scorer)
indexer_data = new_attrs['indexer']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, I highly think we should use "context" instead of "new_attrs". Because it's more like an "update" rather than replacing one attribute with another.

self.indexer._all_titles |= indexer_data._all_titles
self.indexer._filenames |= indexer_data._filenames
self.indexer._index_entries |= indexer_data._index_entries
self.indexer._mapping |= indexer_data._mapping
self.indexer._title_mapping |= indexer_data._title_mapping
self.indexer._titles |= indexer_data._titles

# handle images
for filepath, filename in new_attrs['images'].items():
if filepath not in self.images:
self.images[filepath] = filename

def create_build_info(self) -> BuildInfo:
return BuildInfo(self.config, self.tags, frozenset({'html'}))

Expand Down
13 changes: 13 additions & 0 deletions sphinx/builders/linkcheck.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ class CheckExternalLinksBuilder(DummyBuilder):
epilog = __('Look for any errors in the above output or in '
'%(outdir)s/output.txt')

post_transform_merge_attr = ('hyperlinks',)

def init(self) -> None:
self.broken_hyperlinks = 0
self.timed_out_hyperlinks = 0
Expand All @@ -80,6 +82,17 @@ def init(self) -> None:
)
warnings.warn(deprecation_msg, RemovedInSphinx80Warning, stacklevel=1)

def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None:
"""Merge hyperlinks back to the main builder after parallel
post-transformation.

:param new_attrs: the attributes from the parallel subprocess to be
udpated in the main builder (self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
udpated in the main builder (self)
updated in the main builder (self)

"""
for hyperlink, value in new_attrs['hyperlinks'].items():
if hyperlink not in self.hyperlinks:
self.hyperlinks[hyperlink] = value

def finish(self) -> None:
checker = HyperlinkAvailabilityChecker(self.config)
logger.info('')
Expand Down
1 change: 1 addition & 0 deletions sphinx/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,7 @@ class Config:
'smartquotes_excludes': _Opt(
{'languages': ['ja'], 'builders': ['man', 'text']}, 'env', ()),
'option_emphasise_placeholders': _Opt(False, 'env', ()),
'enable_parallel_post_transform': _Opt(False, 'html', ()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use 'html' or 'env' ? (I always forget how it works).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently yeh it should be env, ideally it would a value to signify that all output formats should be rebuild (whilst still using the cached doctrees and env) but this does not currently exist

}

def __init__(self, config: dict[str, Any] | None = None,
Expand Down
9 changes: 9 additions & 0 deletions sphinx/search/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -553,3 +553,12 @@ def get_js_stemmer_code(self) -> str:
(base_js, language_js, self.lang.language_name))
else:
return self.lang.js_stemmer_code

def __getstate__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __getstate__(self):
def __getstate__(self) -> dict[str, Any]:

(and import Any, if needed)

"""Get the object's state.

Return a copy of self.__dict__ (which contains all instance attributes),
to avoid modifying the original state.
"""
# remove env for performance reasons - it is not not needed by consumers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# remove env for performance reasons - it is not not needed by consumers
# remove env for performance reasons - it is not needed by consumers

return {k: v for k, v in self.__dict__.items() if k != 'env'}
5 changes: 4 additions & 1 deletion tests/test_versioning.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@ def _setup_module(rootdir, sphinx_test_tempdir):
srcdir = sphinx_test_tempdir / 'test-versioning'
if not srcdir.exists():
shutil.copytree(rootdir / 'test-versioning', srcdir)
app = SphinxTestApp(srcdir=srcdir)
# parallelisation is not supported by this test case
# as the global variable 'doctrees' is not preserved
# when subprocesses finish
app = SphinxTestApp(srcdir=srcdir, parallel=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to keep this change as this specific test case really does not support parallel mode.
Setting this explicitly avoids issues when running all test cases in parallel mode (which I did a lot of testing this PR).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for my "harsh" comment. Actually, it was because you had the 'parallel=4' before (but now it's fine).

app.builder.env.app = app
app.connect('doctree-resolved', on_doctree_resolved)
app.build()
Expand Down