-
Notifications
You must be signed in to change notification settings - Fork 18
feat: improved memory management during long split pdf processing #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# If we get 200, dump the contents to a file and return the path | ||
temp_dir = self.tempdirs[operation_id] | ||
if response.status_code == 200: | ||
temp_file_name = f"{temp_dir.name}/{uuid.uuid4()}.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a risk that the customer using the client doesn't have permissions to write files in the current process they're running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. But at the same time unstructured
OS uses temp files heavily. Maybe making the temp dir parametrized via env could resolve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah would guess for our own environments it's safe but a customer could hit an error. maybe we could fall back to using memory if we don't have write access (not sure on implementation there), or we could expose a path in the client that the caller sets up explicitly for temporary storage maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a feature flag in partition parameters (feat disabled by default)
FYI: Added more optimizations and on 1k pdf file the memory usage dropped from ~250MB to ~120MB (<100MB if not reading the contents of a file). |
before merging we should squash into one commit so that it's easy to follow in the public git history |
self.coroutines_to_execute.pop(operation_id, None) | ||
self.api_successful_responses.pop(operation_id, None) | ||
tempdir = self.tempdirs.pop(operation_id, None) | ||
if tempdir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there other areas we should consider cleaning up or does _clear_operation
cover every case? curious if there are scenarios where if the api call fails, we don't clean up and leave files in the directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the limitation of the SDK... it's not supposed to fail cleaning on the condition that all the possible exceptions are caught (and it looks we catch them - but it's not a fireproof :D )
It would be much cleaner and resilient if we could use a nice context manager but because of the hooks we hack here we cannot.
) | ||
# force free PDF object memory | ||
del pdf | ||
pdf_chunks = self._get_pdf_chunk_files(pdf_chunk_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question in general about the new process: how do we improve memory management in general?
So for example, if I have a 5000 page PDF that consumes ~50MB of memory per chunk (let's say this process breaks it down into 100 batches of 50 pages), I understand that when batch 1 of 50mb, it will write that to a file, batch 2 write to file and so on...
so if the job takes 20 minutes to complete, the overall memory usage stays low while we write to files.
However, at the end of a 20 minute time-frame once the 100th batch finally completes, we then take all of the 100 batches written in each file, and bring everything back into memory at this point in time to return it to the caller, right? Are we optimizing purely for the duration of the run to keep memory low, but not concerned that, at the end the memory needs to go back up to ultimately return the desired output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right - we don't want to occupy the memory in a long process (and with hi res it can be tens or hundreds of minutes). But we cannot avoid loading the json files back to the memory because the interface must work the same way as for non split-pdf
calls - so returning a Response with a json attached. That's why we won't avoid the spike at the end.
Maybe we could replace the response read
operation with something smart that would load the files one by one but I don't think we'll get much here.
Also - what's important in these changes - we no longer pass the bytes or io.BytesIO object to the request but a file object which is read to the socket in a buffered way.
sounds good! |
This PR: