Skip to content

Conversation

pawel-kmiecik
Copy link
Contributor

This PR:

  • improves the memory management by freeing the memory early when processing larger files, thus holding the data in memory for a long time
  • instead, it uses tempfiles to keep:
    • pdf_chunks for split_pdf_page functionality
    • partial response elements as json files

@pawel-kmiecik pawel-kmiecik marked this pull request as draft October 30, 2024 12:32
# If we get 200, dump the contents to a file and return the path
temp_dir = self.tempdirs[operation_id]
if response.status_code == 200:
temp_file_name = f"{temp_dir.name}/{uuid.uuid4()}.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a risk that the customer using the client doesn't have permissions to write files in the current process they're running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. But at the same time unstructured OS uses temp files heavily. Maybe making the temp dir parametrized via env could resolve this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah would guess for our own environments it's safe but a customer could hit an error. maybe we could fall back to using memory if we don't have write access (not sure on implementation there), or we could expose a path in the client that the caller sets up explicitly for temporary storage maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a feature flag in partition parameters (feat disabled by default)

@pawel-kmiecik
Copy link
Contributor Author

FYI: Added more optimizations and on 1k pdf file the memory usage dropped from ~250MB to ~120MB (<100MB if not reading the contents of a file).
Next steps include feature flag for memory optimizations using tmp files caching (as @jordan-homan mentioned - it might be a problem for some users).

@pawel-kmiecik pawel-kmiecik marked this pull request as ready for review November 5, 2024 15:06
@jordan-homan
Copy link
Contributor

before merging we should squash into one commit so that it's easy to follow in the public git history

@pawel-kmiecik
Copy link
Contributor Author

pawel-kmiecik commented Nov 7, 2024

before merging we should squash into one commit so that it's easy to follow in the public git history

Isn't it the default setting? I'm not sure what's the squash commit message setting in this repo (It should be the PR description I guess)
Screenshot 2024-11-07 at 15 28 45

EDIT: It looks it takes PR title and description, indeed
Screenshot 2024-11-07 at 15 30 38

self.coroutines_to_execute.pop(operation_id, None)
self.api_successful_responses.pop(operation_id, None)
tempdir = self.tempdirs.pop(operation_id, None)
if tempdir:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there other areas we should consider cleaning up or does _clear_operation cover every case? curious if there are scenarios where if the api call fails, we don't clean up and leave files in the directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the limitation of the SDK... it's not supposed to fail cleaning on the condition that all the possible exceptions are caught (and it looks we catch them - but it's not a fireproof :D )
It would be much cleaner and resilient if we could use a nice context manager but because of the hooks we hack here we cannot.

)
# force free PDF object memory
del pdf
pdf_chunks = self._get_pdf_chunk_files(pdf_chunk_paths)
Copy link
Contributor

@jordan-homan jordan-homan Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question in general about the new process: how do we improve memory management in general?

So for example, if I have a 5000 page PDF that consumes ~50MB of memory per chunk (let's say this process breaks it down into 100 batches of 50 pages), I understand that when batch 1 of 50mb, it will write that to a file, batch 2 write to file and so on...

so if the job takes 20 minutes to complete, the overall memory usage stays low while we write to files.

However, at the end of a 20 minute time-frame once the 100th batch finally completes, we then take all of the 100 batches written in each file, and bring everything back into memory at this point in time to return it to the caller, right? Are we optimizing purely for the duration of the run to keep memory low, but not concerned that, at the end the memory needs to go back up to ultimately return the desired output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right - we don't want to occupy the memory in a long process (and with hi res it can be tens or hundreds of minutes). But we cannot avoid loading the json files back to the memory because the interface must work the same way as for non split-pdf calls - so returning a Response with a json attached. That's why we won't avoid the spike at the end.

Maybe we could replace the response read operation with something smart that would load the files one by one but I don't think we'll get much here.

Also - what's important in these changes - we no longer pass the bytes or io.BytesIO object to the request but a file object which is read to the socket in a buffered way.

@jordan-homan
Copy link
Contributor

jordan-homan commented Nov 7, 2024

Isn't it the default setting? I'm not sure what's the squash commit message setting in this repo (It should be the PR

sounds good!

@pawel-kmiecik pawel-kmiecik merged commit f7c1c94 into main Nov 7, 2024
13 checks passed
@pawel-kmiecik pawel-kmiecik deleted the pawel/fix-split-pdf-memory-usage branch November 7, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants