feat: improved memory management during long split pdf processing #202

pawel-kmiecik · 2024-10-30T11:56:55Z

This PR:

improves the memory management by freeing the memory early when processing larger files, thus holding the data in memory for a long time
instead, it uses tempfiles to keep:
- pdf_chunks for split_pdf_page functionality
- partial response elements as json files

jordan-homan · 2024-10-30T15:17:58Z

src/unstructured_client/_hooks/custom/split_pdf_hook.py

+                # If we get 200, dump the contents to a file and return the path
+                temp_dir = self.tempdirs[operation_id]
+                if response.status_code == 200:
+                    temp_file_name = f"{temp_dir.name}/{uuid.uuid4()}.json"


Is there a risk that the customer using the client doesn't have permissions to write files in the current process they're running?

Good point. But at the same time unstructured OS uses temp files heavily. Maybe making the temp dir parametrized via env could resolve this?

yeah would guess for our own environments it's safe but a customer could hit an error. maybe we could fall back to using memory if we don't have write access (not sure on implementation there), or we could expose a path in the client that the caller sets up explicitly for temporary storage maybe?

I've added a feature flag in partition parameters (feat disabled by default)

pawel-kmiecik · 2024-11-04T16:10:25Z

FYI: Added more optimizations and on 1k pdf file the memory usage dropped from ~250MB to ~120MB (<100MB if not reading the contents of a file).
Next steps include feature flag for memory optimizations using tmp files caching (as @jordan-homan mentioned - it might be a problem for some users).

src/unstructured_client/models/shared/partition_parameters.py

jordan-homan · 2024-11-07T14:22:49Z

before merging we should squash into one commit so that it's easy to follow in the public git history

pawel-kmiecik · 2024-11-07T14:29:47Z

before merging we should squash into one commit so that it's easy to follow in the public git history

Isn't it the default setting? I'm not sure what's the squash commit message setting in this repo (It should be the PR description I guess)

EDIT: It looks it takes PR title and description, indeed

jordan-homan · 2024-11-07T14:28:33Z

src/unstructured_client/_hooks/custom/split_pdf_hook.py

        self.coroutines_to_execute.pop(operation_id, None)
        self.api_successful_responses.pop(operation_id, None)
+        tempdir = self.tempdirs.pop(operation_id, None)
+        if tempdir:


are there other areas we should consider cleaning up or does _clear_operation cover every case? curious if there are scenarios where if the api call fails, we don't clean up and leave files in the directory

That's the limitation of the SDK... it's not supposed to fail cleaning on the condition that all the possible exceptions are caught (and it looks we catch them - but it's not a fireproof :D )
It would be much cleaner and resilient if we could use a nice context manager but because of the hooks we hack here we cannot.

jordan-homan · 2024-11-07T14:41:04Z

src/unstructured_client/_hooks/custom/split_pdf_hook.py

+            )
+            # force free PDF object memory
+            del pdf
+            pdf_chunks = self._get_pdf_chunk_files(pdf_chunk_paths)


Question in general about the new process: how do we improve memory management in general?

So for example, if I have a 5000 page PDF that consumes ~50MB of memory per chunk (let's say this process breaks it down into 100 batches of 50 pages), I understand that when batch 1 of 50mb, it will write that to a file, batch 2 write to file and so on...

so if the job takes 20 minutes to complete, the overall memory usage stays low while we write to files.

However, at the end of a 20 minute time-frame once the 100th batch finally completes, we then take all of the 100 batches written in each file, and bring everything back into memory at this point in time to return it to the caller, right? Are we optimizing purely for the duration of the run to keep memory low, but not concerned that, at the end the memory needs to go back up to ultimately return the desired output?

Yes, you're right - we don't want to occupy the memory in a long process (and with hi res it can be tens or hundreds of minutes). But we cannot avoid loading the json files back to the memory because the interface must work the same way as for non split-pdf calls - so returning a Response with a json attached. That's why we won't avoid the spike at the end.

Maybe we could replace the response read operation with something smart that would load the files one by one but I don't think we'll get much here.

Also - what's important in these changes - we no longer pass the bytes or io.BytesIO object to the request but a file object which is read to the socket in a buffered way.

jordan-homan · 2024-11-07T14:46:32Z

Isn't it the default setting? I'm not sure what's the squash commit message setting in this repo (It should be the PR

sounds good!

pawel-kmiecik added 2 commits October 30, 2024 12:54

feat: improved memory management during long split pdf processing

c461d8c

Merge branch 'main' into pawel/fix-split-pdf-memory-usage

f2582a0

pawel-kmiecik marked this pull request as draft October 30, 2024 12:32

pawel-kmiecik added 4 commits October 30, 2024 13:50

fix: fixed unpacking issues after rebase

6eb94de

chore: added missing aiofiles types

d174e17

chore: fixed type

971f456

chore: removed unnecessary logger, corrected docstrings

e6481b8

jordan-homan reviewed Oct 30, 2024

View reviewed changes

pawel-kmiecik added 6 commits November 4, 2024 13:28

feat: saving working efficient memory saving

09fb86f

chore: cleaning

b06af78

test: updated tests

96f3b98

fix: minor fixes for unit tests

004d576

test: added/updated tests

237397e

Merge branch 'main' into pawel/fix-split-pdf-memory-usage

da73452

pawel-kmiecik added 3 commits November 5, 2024 15:39

feat: introduce feature flag for the feature

7fcc964

test: added integration test for caching mechanism

05a7a36

chore: fixing lint

49e3c8e

pawel-kmiecik marked this pull request as ready for review November 5, 2024 15:06

jordan-homan reviewed Nov 7, 2024

View reviewed changes

src/unstructured_client/models/shared/partition_parameters.py Outdated Show resolved Hide resolved

pawel-kmiecik added 2 commits November 7, 2024 15:15

fix: reverted modifying generated file

2b7cd4e

fix: defined additional parameters in overlay file

a46ebb2

jordan-homan reviewed Nov 7, 2024

View reviewed changes

jordan-homan approved these changes Nov 7, 2024

View reviewed changes

pawel-kmiecik merged commit f7c1c94 into main Nov 7, 2024
13 checks passed

pawel-kmiecik deleted the pawel/fix-split-pdf-memory-usage branch November 7, 2024 15:27

feat: improved memory management during long split pdf processing #202

feat: improved memory management during long split pdf processing #202

Uh oh!

Conversation

pawel-kmiecik commented Oct 30, 2024

Uh oh!

jordan-homan Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

pawel-kmiecik Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

jordan-homan Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

pawel-kmiecik Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

pawel-kmiecik commented Nov 4, 2024

Uh oh!

Uh oh!

jordan-homan commented Nov 7, 2024

Uh oh!

pawel-kmiecik commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordan-homan Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

pawel-kmiecik Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

jordan-homan Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawel-kmiecik Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

jordan-homan commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pawel-kmiecik commented Nov 7, 2024 •

edited

Loading

jordan-homan Nov 7, 2024 •

edited

Loading

jordan-homan commented Nov 7, 2024 •

edited

Loading