-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-4146 Improve GridFS upload performance by batch writing chunks with insert_many #1478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…unks with insert_many
gridfs/grid_file.py
Outdated
self._buffered_docs.append( | ||
{"files_id": self._file["_id"], "n": self._chunk_number, "data": Binary(data)} | ||
) | ||
self._buffered_docs_size += len(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this include rest of the document size as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm good point. This won't matter much in practice because the default chunk_size (256KB) dominates the document overhead but I could see some pathological cases where the app uses chunk_size=1 or a very large "files_id" where this could behave poorly. In those cases we'd batch up far more documents in memory than we'd need.
I see two fixes that are simpler than guestimating the document size:
- encode to RawBSONDocument so we know the real size
- limit the batch to at most 100,000 documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided not to use RawBSONDocument as is was tricky to get right (which codec options to use), is less readable code-wise, and is a little less efficient (due to extra memory copies shuffling around the raw documents).
I did add the 100,000 limit (OP_MSG's maxWriteBatchSize) and added some extra overhead to account for the size of the chunk doc encoded as bson.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This results in a 42% perf increase for the GridFsUpload benchmark: https://spruce.mongodb.com/task/mongo_python_driver_perf_tests_perf_6.0_standalone_patch_b8d6bfdf085081ccf509324dfcda7ea51434eae0_65ab3f1f0305b9c51bb8a7a3_24_01_20_03_33_52/trend-charts?execution=0&sortBy=STATUS&sortDir=ASC |
https://jira.mongodb.org/browse/PYTHON-4146