-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Parallel post transform including write_doc_serialized #11746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
14c6a15
5d887b7
b3a5778
3b42a05
d9b2de7
69a77c7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -75,6 +75,14 @@ class Builder: | |
supported_remote_images = False | ||
#: The builder supports data URIs or not. | ||
supported_data_uri_images = False | ||
#: Builder attributes that should be returned from parallel | ||
#: post transformation, to be merged to the main builder in | ||
#: :py:class:`~sphinx.builders.Builder.merge_builder_post_transform`. | ||
#: Attributes in the list must be pickleable. | ||
#: The approach improves performance when pickling and sending data | ||
#: over pipes because only a subset of the builder attributes | ||
#: are commonly needed for merging to the main process builder instance. | ||
post_transform_merge_attr: tuple[str, ...] = () | ||
|
||
def __init__(self, app: Sphinx, env: BuildEnvironment) -> None: | ||
self.srcdir = app.srcdir | ||
|
@@ -125,6 +133,26 @@ def init(self) -> None: | |
""" | ||
pass | ||
|
||
def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None: | ||
"""Merge post-transform data into the master builder. | ||
|
||
This method allows builders to merge any post-transform information | ||
coming from parallel subprocesses back into the builder in | ||
the main process. This can be useful for extensions that consume | ||
that information in the build-finish phase. | ||
The function is called once for each finished subprocess. | ||
Builders that implement this function must also define the class | ||
attribute :py:attr:`~sphinx.builders.Builder.post_transform_merge_attr` | ||
as it defines which builder attributes shall be returned to | ||
the main process for merging. | ||
|
||
The default implementation does nothing. | ||
|
||
:param new_attrs: The attributes from the parallel subprocess to be | ||
updated in the main builder | ||
""" | ||
pass | ||
|
||
def create_template_bridge(self) -> None: | ||
"""Return the template bridge configured.""" | ||
if self.config.template_bridge: | ||
|
@@ -564,7 +592,10 @@ def write( | |
|
||
if self.parallel_ok: | ||
# number of subprocesses is parallel-1 because the main process | ||
# is busy loading doctrees and doing write_doc_serialized() | ||
# is busy loading and post-transforming doctrees and doing | ||
# write_doc_serialized(); | ||
# in case the global configuration enable_parallel_post_transform | ||
# is active the main process does nothing | ||
self._write_parallel(sorted(docnames), | ||
nproc=self.app.parallel - 1) | ||
else: | ||
|
@@ -581,11 +612,34 @@ def _write_serial(self, docnames: Sequence[str]) -> None: | |
self.write_doc(docname, doctree) | ||
|
||
def _write_parallel(self, docnames: Sequence[str], nproc: int) -> None: | ||
def write_process(docs: list[tuple[str, nodes.document]]) -> None: | ||
def write_process_serial_post_transform( | ||
docs: list[tuple[str, nodes.document]], | ||
) -> None: | ||
self.app.phase = BuildPhase.WRITING | ||
# The doctree has been post-transformed (incl. write_doc_serialized) | ||
# in the main process, only write_doc() is needed here | ||
for docname, doctree in docs: | ||
self.write_doc(docname, doctree) | ||
|
||
def write_process_parallel_post_transform(docs: list[str]) -> bytes: | ||
assert self.env.config.enable_parallel_post_transform | ||
self.app.phase = BuildPhase.WRITING | ||
# run post-transform, doctree-resolved and write_doc_serialized in parallel | ||
for docname in docs: | ||
doctree = self.env.get_and_resolve_doctree(docname, self) | ||
# write_doc_serialized is assumed to be safe for all Sphinx | ||
# internal builders. Some builders merge information from post-transform | ||
# and write_doc_serialized back to the main process using | ||
# Builder.post_transform_merge_attr and | ||
# Builder.merge_builder_post_transform | ||
self.write_doc_serialized(docname, doctree) | ||
self.write_doc(docname, doctree) | ||
merge_attr = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add (in the docstring or somewhere) that the attributes defined by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a robust/flexible enough mechanism for merging data generated during post-processing:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be two clear "hook" points, that "anyone" (builders, extensions, etc) can use:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand the approach is not super flexible. There are however some reasons why I went for it:
In general I think the whole idea of sending huge objects through pipes for flexibility has too many downsides. It's slow, can break, is difficult to debug and hard to make right. I'm also worrying about the implications when moving to spawn as child start method (or going for libraries such as mpire). I just went for the slim variant. Going forward it would make sense to offer an alternative approach to the whole multiprocessing mechanism. E.g sqlite as data store - for surgical modifications by extensions instead of ever growing Python objects. |
||
attr: getattr(self, attr, None) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we really return |
||
for attr in self.post_transform_merge_attr | ||
} | ||
return pickle.dumps(merge_attr, pickle.HIGHEST_PROTOCOL) | ||
|
||
# warm up caches/compile templates using the first document | ||
firstname, docnames = docnames[0], docnames[1:] | ||
self.app.phase = BuildPhase.RESOLVING | ||
|
@@ -598,21 +652,34 @@ def write_process(docs: list[tuple[str, nodes.document]]) -> None: | |
chunks = make_chunks(docnames, nproc) | ||
|
||
# create a status_iterator to step progressbar after writing a document | ||
# (see: ``on_chunk_done()`` function) | ||
# (see: ``merge_builder()`` function) | ||
progress = status_iterator(chunks, __('writing output... '), "darkgreen", | ||
len(chunks), self.app.verbosity) | ||
|
||
def on_chunk_done(args: list[tuple[str, NoneType]], result: NoneType) -> None: | ||
next(progress) | ||
|
||
def merge_builder( | ||
args: list[tuple[str, NoneType]], pickled_new_attrs: bytes, /, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the args ? Why is the second item always |
||
) -> None: | ||
assert self.env.config.enable_parallel_post_transform | ||
new_attrs: dict[str, Any] = pickle.loads(pickled_new_attrs) | ||
self.merge_builder_post_transform(new_attrs) | ||
next(progress) | ||
|
||
self.app.phase = BuildPhase.RESOLVING | ||
for chunk in chunks: | ||
arg = [] | ||
for docname in chunk: | ||
doctree = self.env.get_and_resolve_doctree(docname, self) | ||
self.write_doc_serialized(docname, doctree) | ||
arg.append((docname, doctree)) | ||
tasks.add_task(write_process, arg, on_chunk_done) | ||
if self.env.config.enable_parallel_post_transform: | ||
tasks.add_task(write_process_parallel_post_transform, | ||
chunk, merge_builder) | ||
else: | ||
arg = [] | ||
for docname in chunk: | ||
doctree = self.env.get_and_resolve_doctree(docname, self) | ||
self.write_doc_serialized(docname, doctree) | ||
arg.append((docname, doctree)) | ||
tasks.add_task(write_process_serial_post_transform, | ||
arg, on_chunk_done) | ||
|
||
# make sure all threads have finished | ||
tasks.join() | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -155,6 +155,7 @@ class EpubBuilder(StandaloneHTMLBuilder): | |||||
refuri_re = REFURI_RE | ||||||
template_dir = "" | ||||||
doctype = "" | ||||||
post_transform_merge_attr = ('images',) | ||||||
|
||||||
def init(self) -> None: | ||||||
super().init() | ||||||
|
@@ -167,6 +168,17 @@ def init(self) -> None: | |||||
self.use_index = self.get_builder_config('use_index', 'epub') | ||||||
self.refnodes: list[dict[str, Any]] = [] | ||||||
|
||||||
def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAICT, I think new_attrs are not meant to be writable so a mapping should be fine. By the way, "new_attrs" is a bit confusing for me. Maybe "context" or "extras" could be a better name. |
||||||
"""Merge images back to the main builder after parallel | ||||||
post-transformation. | ||||||
|
||||||
:param new_attrs: the attributes from the parallel subprocess to be | ||||||
udpated in the main builder (self) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
""" | ||||||
for filepath, filename in new_attrs['images'].items(): | ||||||
if filepath not in self.images: | ||||||
self.images[filepath] = filename | ||||||
|
||||||
def create_build_info(self) -> BuildInfo: | ||||||
return BuildInfo(self.config, self.tags, frozenset({'html', 'epub'})) | ||||||
|
||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -36,7 +36,7 @@ | |||||
from sphinx.errors import ConfigError, ThemeError | ||||||
from sphinx.highlighting import PygmentsBridge | ||||||
from sphinx.locale import _, __ | ||||||
from sphinx.search import js_index | ||||||
from sphinx.search import IndexBuilder, js_index | ||||||
from sphinx.theming import HTMLThemeFactory | ||||||
from sphinx.util import isurl, logging | ||||||
from sphinx.util.display import progress_message, status_iterator | ||||||
|
@@ -188,6 +188,7 @@ class StandaloneHTMLBuilder(Builder): | |||||
|
||||||
imgpath: str = '' | ||||||
domain_indices: list[DOMAIN_INDEX_TYPE] = [] | ||||||
post_transform_merge_attr: tuple[str, ...] = ('images', 'indexer') | ||||||
|
||||||
def __init__(self, app: Sphinx, env: BuildEnvironment) -> None: | ||||||
super().__init__(app, env) | ||||||
|
@@ -213,6 +214,7 @@ def __init__(self, app: Sphinx, env: BuildEnvironment) -> None: | |||||
op = pub.setup_option_parser(output_encoding='unicode', traceback=True) | ||||||
pub.settings = op.get_default_values() | ||||||
self._publisher = pub | ||||||
self.indexer: IndexBuilder | None = None | ||||||
|
||||||
def init(self) -> None: | ||||||
self.build_info = self.create_build_info() | ||||||
|
@@ -240,6 +242,32 @@ def init(self) -> None: | |||||
|
||||||
self.use_index = self.get_builder_config('use_index', 'html') | ||||||
|
||||||
def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None: | ||||||
"""Merge images and search indexer back to the main builder after parallel | ||||||
post-transformation. | ||||||
|
||||||
:param new_attrs: the attributes from the parallel subprocess to be | ||||||
udpated in the main builder (self) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
""" | ||||||
# handle indexer | ||||||
if self.indexer is None: | ||||||
lang = self.config.html_search_language or self.config.language | ||||||
self.indexer = IndexBuilder(self.env, lang, | ||||||
self.config.html_search_options, | ||||||
self.config.html_search_scorer) | ||||||
indexer_data = new_attrs['indexer'] | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With this, I highly think we should use "context" instead of "new_attrs". Because it's more like an "update" rather than replacing one attribute with another. |
||||||
self.indexer._all_titles |= indexer_data._all_titles | ||||||
self.indexer._filenames |= indexer_data._filenames | ||||||
self.indexer._index_entries |= indexer_data._index_entries | ||||||
self.indexer._mapping |= indexer_data._mapping | ||||||
self.indexer._title_mapping |= indexer_data._title_mapping | ||||||
self.indexer._titles |= indexer_data._titles | ||||||
|
||||||
# handle images | ||||||
for filepath, filename in new_attrs['images'].items(): | ||||||
if filepath not in self.images: | ||||||
self.images[filepath] = filename | ||||||
|
||||||
def create_build_info(self) -> BuildInfo: | ||||||
return BuildInfo(self.config, self.tags, frozenset({'html'})) | ||||||
|
||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -64,6 +64,8 @@ class CheckExternalLinksBuilder(DummyBuilder): | |||||
epilog = __('Look for any errors in the above output or in ' | ||||||
'%(outdir)s/output.txt') | ||||||
|
||||||
post_transform_merge_attr = ('hyperlinks',) | ||||||
|
||||||
def init(self) -> None: | ||||||
self.broken_hyperlinks = 0 | ||||||
self.timed_out_hyperlinks = 0 | ||||||
|
@@ -80,6 +82,17 @@ def init(self) -> None: | |||||
) | ||||||
warnings.warn(deprecation_msg, RemovedInSphinx80Warning, stacklevel=1) | ||||||
|
||||||
def merge_builder_post_transform(self, new_attrs: dict[str, Any]) -> None: | ||||||
"""Merge hyperlinks back to the main builder after parallel | ||||||
post-transformation. | ||||||
|
||||||
:param new_attrs: the attributes from the parallel subprocess to be | ||||||
udpated in the main builder (self) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
""" | ||||||
for hyperlink, value in new_attrs['hyperlinks'].items(): | ||||||
if hyperlink not in self.hyperlinks: | ||||||
self.hyperlinks[hyperlink] = value | ||||||
|
||||||
def finish(self) -> None: | ||||||
checker = HyperlinkAvailabilityChecker(self.config) | ||||||
logger.info('') | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -254,6 +254,7 @@ class Config: | |
'smartquotes_excludes': _Opt( | ||
{'languages': ['ja'], 'builders': ['man', 'text']}, 'env', ()), | ||
'option_emphasise_placeholders': _Opt(False, 'env', ()), | ||
'enable_parallel_post_transform': _Opt(False, 'html', ()), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we use 'html' or 'env' ? (I always forget how it works). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently yeh it should be |
||
} | ||
|
||
def __init__(self, config: dict[str, Any] | None = None, | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -553,3 +553,12 @@ def get_js_stemmer_code(self) -> str: | |||||
(base_js, language_js, self.lang.language_name)) | ||||||
else: | ||||||
return self.lang.js_stemmer_code | ||||||
|
||||||
def __getstate__(self): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
(and import Any, if needed) |
||||||
"""Get the object's state. | ||||||
|
||||||
Return a copy of self.__dict__ (which contains all instance attributes), | ||||||
to avoid modifying the original state. | ||||||
""" | ||||||
# remove env for performance reasons - it is not not needed by consumers | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
return {k: v for k, v in self.__dict__.items() if k != 'env'} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,7 +17,10 @@ def _setup_module(rootdir, sphinx_test_tempdir): | |
srcdir = sphinx_test_tempdir / 'test-versioning' | ||
if not srcdir.exists(): | ||
shutil.copytree(rootdir / 'test-versioning', srcdir) | ||
app = SphinxTestApp(srcdir=srcdir) | ||
# parallelisation is not supported by this test case | ||
# as the global variable 'doctrees' is not preserved | ||
# when subprocesses finish | ||
app = SphinxTestApp(srcdir=srcdir, parallel=0) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Revert it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to keep this change as this specific test case really does not support parallel mode. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah sorry for my "harsh" comment. Actually, it was because you had the 'parallel=4' before (but now it's fine). |
||
app.builder.env.app = app | ||
app.connect('doctree-resolved', on_doctree_resolved) | ||
app.build() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be a dict or could you live with a
Mapping
(so that we really show that we do not act on the inputs).