Skip to content

Conversation

ShaneHarvey
Copy link
Member

@ShaneHarvey ShaneHarvey marked this pull request as ready for review January 19, 2024 23:41
@ShaneHarvey ShaneHarvey requested a review from a team as a code owner January 19, 2024 23:41
@ShaneHarvey ShaneHarvey requested review from blink1073 and removed request for a team January 19, 2024 23:41
self._buffered_docs.append(
{"files_id": self._file["_id"], "n": self._chunk_number, "data": Binary(data)}
)
self._buffered_docs_size += len(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this include rest of the document size as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point. This won't matter much in practice because the default chunk_size (256KB) dominates the document overhead but I could see some pathological cases where the app uses chunk_size=1 or a very large "files_id" where this could behave poorly. In those cases we'd batch up far more documents in memory than we'd need.

I see two fixes that are simpler than guestimating the document size:

  • encode to RawBSONDocument so we know the real size
  • limit the batch to at most 100,000 documents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to use RawBSONDocument as is was tricky to get right (which codec options to use), is less readable code-wise, and is a little less efficient (due to extra memory copies shuffling around the raw documents).

I did add the 100,000 limit (OP_MSG's maxWriteBatchSize) and added some extra overhead to account for the size of the chunk doc encoded as bson.

@ShaneHarvey ShaneHarvey requested a review from blink1073 January 20, 2024 03:37
Copy link
Member

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ShaneHarvey ShaneHarvey merged commit 2fc4282 into mongodb:master Jan 22, 2024
@ShaneHarvey
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants