Skip to content

[FIX] spreadsheet: batch process spreadsheet_revision.commands #284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vval-odoo
Copy link

@vval-odoo vval-odoo commented Jun 11, 2025

[FIX] spreadsheet: batch process spreadsheet_revision.commands

Some dbs have spreadsheet_revision records with over 10 millions characters in commands. If the number of record is high, this leads to memory errors here. We distribute them in buckets of memory_cap maximum size, and use a named cursor to process them in buckets. Commands larger than memory_cap fit in one bucket.

select count(*),SUM(CHAR_LENGTH(commands)) as characters from spreadsheet_revision where commands like any (array['%"DELETE_SHEET"%', '%"MOVE_COLUMNS_ROWS"%', '%"REMOVE_COLUMNS_ROWS"%', '%"ADD_COLUMNS_ROWS"%', '%"RENAME_SHEET"%'])
+-------+------------+
| count | characters |
|-------+------------|
| 2317  | 3419017646 |
+-------+------------+
select id,CHAR_LENGTH(commands) from spreadsheet_revision where commands like any (array['%"DELETE_SHEET"%', '%"MOVE_COLUMNS_ROWS"%', '%"REMOVE_COLUMNS_ROWS"%', '%"ADD_COLUMNS_ROWS"%', '%"RENAME_SHEET"%']) order by CHAR_LENGTH(commands) desc limit 10
+------+-------------+
| id   | char_length |
|------+-------------|
| 3871 | 13197157    |
| 4788 | 13197157    |
| 3290 | 13197157    |
| 3557 | 13197157    |
| 4179 | 13197157    |
| 4481 | 13197157    |
| 2757 | 13197157    |
| 3022 | 13197157    |
| 2492 | 13197157    |
| 5097 | 13197157    |
+------+-------------+

Fixes upg-2899961

@robodoo
Copy link
Contributor

robodoo commented Jun 11, 2025

Pull request status dashboard

@vval-odoo vval-odoo requested review from a team, jjmaksoud and aj-fuentes June 11, 2025 08:49
Copy link
Contributor

@Pirols Pirols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! :)



def iter_commands(cr, like_all=(), like_any=()):
if not (bool(like_all) ^ bool(like_any)):
raise ValueError("Please specify `like_all` or `like_any`, not both")
cr.execute(
ncr = pg.named_cursor(cr, itersize=BATCH_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a context manager you do not need to close it explicitely1.

Suggested change
ncr = pg.named_cursor(cr, itersize=BATCH_SIZE)
with pg.named_cursor(cr, itersize=BATCH_SIZE) as ncr:

That said, this is just in the name of a more pythonic implementation. IOW: imo you can keep your current version, if you like it better.

Footnotes

  1. https://github.com/odoo/odoo//blob/bdc28e53b7c426640e30b900064442f19c946d9e/odoo/sql_db.py#L221-L238

@vval-odoo vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 3752d09 to 327a6f6 Compare June 11, 2025 11:16
@KangOl
Copy link
Contributor

KangOl commented Jun 11, 2025

upgradeci retry with always only *spreadsheet*

@vval-odoo vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 327a6f6 to 508732d Compare June 13, 2025 12:20
@vval-odoo vval-odoo requested a review from aj-fuentes June 20, 2025 07:56
@Pirols
Copy link
Contributor

Pirols commented Jun 25, 2025

Another affected request: https://upgrade.odoo.com/odoo/action-150/2988031

@vval-odoo vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 508732d to 1bfec7f Compare June 25, 2025 13:10
""".format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"),
[list(like_all or like_any)],
)
for ids, datas in ncr.fetchmany(size=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only output the first bucket.

A simple solution is what you had already:

Suggested change
for ids, datas in ncr.fetchmany(size=1):
for ids, datas in ncr:

With with pg.named_cursor(cr, itersize=1) as ncr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Edoardo. You just need to limit the itersize. Under the hood psycopg will fetch one by one and each line already contains multiple records (as many as they fit in one bucket).

cr.execute(
"UPDATE spreadsheet_revision SET commands=%s WHERE id=%s", [json.dumps(data_loaded), revision_id]

with pg.named_cursor(cr) as ncr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pg.named_cursor(cr) as ncr:
with pg.named_cursor(cr, itersize=1) as ncr:

https://www.psycopg.org/docs/usage.html#server-side-cursors

@vval-odoo vval-odoo force-pushed the master-batch-process-big-commands-vval branch 2 times, most recently from 03175d2 to 3367e6d Compare June 25, 2025 14:39
@vval-odoo
Copy link
Author

vval-odoo commented Jun 25, 2025

I added ORDER BY LENGTH(commands) here

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num

as it could reduce the number of buckets if length is randomly distributed among ids.

ARRAY_AGG(commands ORDER BY id)
FROM buckets
GROUP BY num, alone
""".format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"),
Copy link
Contributor

@aj-fuentes aj-fuentes Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use format_query to set ALL or ANY. Then pass memory_cap as parameter to the query in ncr.execute
EDIT: condition = util.SQLStr("ALL" if like_all else "ANY")

"""
WITH buckets AS (
SELECT id,
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / %s AS num,

WITH buckets AS (
SELECT id,
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,
LENGTH(commands) > {memory_cap} AS alone,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LENGTH(commands) > {memory_cap} AS alone,
LENGTH(commands) > %s AS alone,

@aj-fuentes
Copy link
Contributor

I added ORDER BY LENGTH(commands) here

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num

as it could reduce the number of buckets if length is randomly distributed among ids.

Note that this could still lead to a big record being grouped with many more "smaller" records potentially adding up to 200mb to the size of the fetched data --vs what would be fetched if the record was alone. If we don't avoid that case I don't see any advantage in ordering by length.

@vval-odoo
Copy link
Author

Note that this could still lead to a big record being grouped with many more "smaller" records potentially adding up to 200mb to the size of the fetched data --vs what would be fetched if the record was alone. If we don't avoid that case I don't see any advantage in ordering by length.

That was the idea behind my original query. Compute buckets that would only go above memory_cap in case of lone record bigger than memory_cap. But for this we need to know for each row what will be the size of following one, so it complicates the logic a bit. Should I go with this one then?

        ncr.execute(
            """
            WITH RECURSIVE start_bucket AS (
                SELECT 1 AS bucket
            ),
            ordered_rows AS (
                SELECT id,
                       commands,
                       LENGTH(commands) AS length,
                       ROW_NUMBER() OVER (ORDER BY LENGTH(commands), id) AS rownum
                  FROM spreadsheet_revision
                 WHERE commands LIKE {condition}(%s::text[])
            ),
            assign AS (
                SELECT o.id AS id,
                       o.commands as commands,
                       o.length,
                       o.rownum,
                       sb.bucket AS bucket,
                       o.length AS sum
                  FROM ordered_rows o, start_bucket sb
                 WHERE o.rownum = 1

                UNION ALL

                SELECT o.id AS id,
                       o.commands as commands,
                       o.length,
                       o.rownum,
                       CASE
                            WHEN a.sum + o.length > %s THEN a.bucket + 1
                            ELSE a.bucket
                       END AS bucket,
                       CASE
                            WHEN a.sum + o.length > %s THEN o.length
                            ELSE a.sum + o.length
                       END AS sum
                  FROM assign a
                  JOIN ordered_rows o
                    ON o.rownum = a.rownum + 1
            )
            SELECT ARRAY_AGG(id),
                   ARRAY_AGG(commands)
              FROM assign
          GROUP BY bucket
            """.format(condition=pg.SQLStr("ALL" if like_all else "ANY")),
            [list(like_all or like_any), MEMORY_CAP, MEMORY_CAP]
        )

@aj-fuentes
Copy link
Contributor

aj-fuentes commented Jun 26, 2025

That was the idea behind my original query. Compute buckets that would only go above memory_cap in case of lone record bigger than memory_cap. But for this we need to know for each row what will be the size of following one, so it complicates the logic a bit. Should I go with this one then?

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket. Therefore if we group by num, alone we achieve our goal: big records won't be bucketed with smaller ones. You can think of it as having two groupings one for big records that will be fetched individually, and another for smaller records that would be clustered by bucket.

EDIT: note that only records bigger than bucket size needs to be set alone. Because all remaining records will never be fetched in a more than 2*MEMORY_CAP size group. Thus by choosing a cap that's acceptable (I think 200MB is fine as max 400MB would be fetched for smaller records, while bigger ones won't have any cap but fetched alone) we are done and we don't need any more complex logic here to decide about buckets. Here is a way to do this in a clearer way without the "magic" alone column (pseudo sql)

SELECT ARRAY[id], ARRAY[commands]
  FROM table
 WHERE len(commands)>cap
 UNION 
SELECT array_agg(data.id), array_agg(data.commands)
  FROM (SELECT id, commands, sum(len(commands))/cap 
          FROM table
         WHERE len(commands)<=cap
       ) AS data(id,commands,num)
 GROUP BY data.num

@KangOl
Copy link
Contributor

KangOl commented Jun 26, 2025

we need to know for each row what will be the size of following one

To do so, you can use the lead window function: https://www.postgresql.org/docs/13/functions-window.html

But other window functions can be used to get memory_cap groups or rows.

@KangOl
Copy link
Contributor

KangOl commented Jun 26, 2025

No need for lag/lead, we can just use a cumulative sum to get the ids to fetch in successive queries

  with _groups as (
    select id, sum(length(commands)) over (ORDER BY id) / {mem_cap} as cs
      from spreadsheet_revision
     where {condition}
  )
  select array_agg(id)
    from _groups
group by cs

@Pirols
Copy link
Contributor

Pirols commented Jun 26, 2025

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket.

I don't think that the alone trick is enough, it still does not reason about wasted bucket space. Imagine the following scenario: bucket size is 10 chars, the first command (c0) is 6 chars long, second one (c1) is 6 chars long, third one (c2) is 5 chars long. No one qualifies as alone. c0 goes into b0 (first bucket), c1 goes into b1 (because 12 / 10 > 1), third command also goes into b1 (17/10 < 2), however len(c2) + len(c3) = 11 > 10 = bucket_size:

|----b0----|----b1----|----b2----|
 **c0**
            **c1**
                  *c2**

@vval-odoo
Copy link
Author

vval-odoo commented Jun 26, 2025

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket. Therefore if we group by num, alone we achieve our goal: big records won't be bucketed with smaller ones. You can think of it as having two groupings one for big records that will be fetched individually, and another for smaller records that would be clustered by bucket.

I think that two records bigger than memory_cap can't get in the same bucket, even without the "alone" trick because of the division. What this tricks prevents is for smaller records to "complete" a bucket that is already too large, but that doesn't solve any problem because a bucket of 15,something * memory_cap would make the program crash as would a bucket of 15 * memory_cap. What I try to prevent is getting buckets between memory_cap and 2 * memory_cap.
I'll try what you proposed in your 'edit', but we have to agree that the real memory cap is 2 * memory_cap, right?

@aj-fuentes
Copy link
Contributor

aj-fuentes commented Jun 26, 2025

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket.

I don't think that the alone trick is enough, it still does not reason about wasted bucket space. Imagine the following scenario: bucket size is 10 chars, the first command (c1) is 6 chars long, second one (c2) is 6 chars long, third one (c3) is 5 chars long. No one qualifies as alone. First command goes into b0 (first bucket), second command goes into b1 (because 12 / 10 > 1), third command also goes into b1 (17/10 < 2), however len(c2) + len(c3) = 11 > bucket_size = 10:

|----b0----|----b1----|----b2----|
 **c1**
            **c2**
                  *c3**

I think it is correct. See we don't care about 2*max_cap size since we know that's the hard max. We care about a really big record, think about 5*max_cap, getting even more smaller ones. In your example, yes c2+c3 > max_cap, but still < 2*max_cap. That's my point. The logic is simpler and as proven by the example here the biggest record is bigger than 2*max_cap and still acceptable for memory.

@aj-fuentes
Copy link
Contributor

I'll try what you proposed in your 'edit', but we have to agree that the real memory cap is 2 * memory_cap, right?

@vval-odoo yes. Which is fine.

Copy link
Contributor

@Pirols Pirols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove seemingly useless ordering?

"""
WITH buckets AS (
SELECT id,
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands)) / {memory_cap} AS num,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order is necessary. Example just to illustrate that we don't really want the sum to go over a random order.

=> select id,sum(id) over(order by id) from res_users order by id
+----+-----+
| id | sum |
|----+-----|
| 1  | 1   |
| 2  | 3   |
| 3  | 6   |
| 4  | 10  |
+----+-----+
=> select id,sum(id) over(order by id desc) from res_users order by id
+----+-----+
| id | sum |
|----+-----|
| 1  | 10  |
| 2  | 9   |
| 3  | 7   |
| 4  | 4   |
+----+-----+

Some dbs have `spreadsheet_revision` records with over 10 millions characters
in `commands`. If the number of record is high, this leads to memory errors.
We distribute them in buckets of `memory_cap` maximum size, and use a named
cursor to process them in buckets. Commands larger than `memory_cap` fit in one
bucket.
@vval-odoo vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 3367e6d to 3e92404 Compare June 26, 2025 13:39
@KangOl
Copy link
Contributor

KangOl commented Jun 26, 2025

upgradeci retry

1 similar comment
@KangOl
Copy link
Contributor

KangOl commented Jun 26, 2025

upgradeci retry

Copy link
Contributor

@aj-fuentes aj-fuentes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use format_query.

ARRAY[commands]
FROM filtered
WHERE commands_length > %s
""".format(condition=pg.SQLStr("ALL" if like_all else "ANY")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
""".format(condition=pg.SQLStr("ALL" if like_all else "ANY")),
""",
condition=pg.SQLStr("ALL" if like_all else "ANY"),
) # close format_query


with pg.named_cursor(cr, itersize=1) as ncr:
ncr.execute(
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use format_query. The query is becoming increasingly complex, better use the right formatting tool to avoid issues later.

Suggested change
"""
util.format_query(
cr,
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants