[FIX] spreadsheet: batch process `spreadsheet_revision.commands` #284

vval-odoo · 2025-06-11T08:48:06Z

[FIX] spreadsheet: batch process spreadsheet_revision.commands

Some dbs have spreadsheet_revision records with over 10 millions characters in commands. If the number of record is high, this leads to memory errors here. We distribute them in buckets of memory_cap maximum size, and use a named cursor to process them in buckets. Commands larger than memory_cap fit in one bucket.

select count(*),SUM(CHAR_LENGTH(commands)) as characters from spreadsheet_revision where commands like any (array['%"DELETE_SHEET"%', '%"MOVE_COLUMNS_ROWS"%', '%"REMOVE_COLUMNS_ROWS"%', '%"ADD_COLUMNS_ROWS"%', '%"RENAME_SHEET"%'])
+-------+------------+
| count | characters |
|-------+------------|
| 2317  | 3419017646 |
+-------+------------+

select id,CHAR_LENGTH(commands) from spreadsheet_revision where commands like any (array['%"DELETE_SHEET"%', '%"MOVE_COLUMNS_ROWS"%', '%"REMOVE_COLUMNS_ROWS"%', '%"ADD_COLUMNS_ROWS"%', '%"RENAME_SHEET"%']) order by CHAR_LENGTH(commands) desc limit 10
+------+-------------+
| id   | char_length |
|------+-------------|
| 3871 | 13197157    |
| 4788 | 13197157    |
| 3290 | 13197157    |
| 3557 | 13197157    |
| 4179 | 13197157    |
| 4481 | 13197157    |
| 2757 | 13197157    |
| 3022 | 13197157    |
| 2492 | 13197157    |
| 5097 | 13197157    |
+------+-------------+

Fixes upg-2899961

robodoo · 2025-06-11T08:48:08Z

Pirols

Good work! :)

src/util/spreadsheet/misc.py

Pirols · 2025-06-11T09:20:33Z

src/util/spreadsheet/misc.py



 def iter_commands(cr, like_all=(), like_any=()):
    if not (bool(like_all) ^ bool(like_any)):
        raise ValueError("Please specify `like_all` or `like_any`, not both")
-    cr.execute(
+    ncr = pg.named_cursor(cr, itersize=BATCH_SIZE)


Using a context manager you do not need to close it explicitely¹.

Suggested change

ncr = pg.named_cursor(cr, itersize=BATCH_SIZE)

with pg.named_cursor(cr, itersize=BATCH_SIZE) as ncr:

That said, this is just in the name of a more pythonic implementation. IOW: imo you can keep your current version, if you like it better.

Footnotes

https://github.com/odoo/odoo//blob/bdc28e53b7c426640e30b900064442f19c946d9e/odoo/sql_db.py#L221-L238 ↩

KangOl · 2025-06-11T13:08:09Z

upgradeci retry with always only *spreadsheet*

src/util/spreadsheet/misc.py

Pirols · 2025-06-25T07:50:42Z

Another affected request: https://upgrade.odoo.com/odoo/action-150/2988031

Pirols · 2025-06-25T13:21:13Z

src/util/spreadsheet/misc.py

+            """.format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"),
+            [list(like_all or like_any)],
+        )
+        for ids, datas in ncr.fetchmany(size=1):


This will only output the first bucket.

A simple solution is what you had already:

Suggested change

for ids, datas in ncr.fetchmany(size=1):

for ids, datas in ncr:

With with pg.named_cursor(cr, itersize=1) as ncr

I agree with Edoardo. You just need to limit the itersize. Under the hood psycopg will fetch one by one and each line already contains multiple records (as many as they fit in one bucket).

aj-fuentes · 2025-06-25T13:32:33Z

src/util/spreadsheet/misc.py

-            cr.execute(
-                "UPDATE spreadsheet_revision SET commands=%s WHERE id=%s", [json.dumps(data_loaded), revision_id]
+
+    with pg.named_cursor(cr) as ncr:


Suggested change

with pg.named_cursor(cr) as ncr:

with pg.named_cursor(cr, itersize=1) as ncr:

https://www.psycopg.org/docs/usage.html#server-side-cursors

src/util/spreadsheet/misc.py

vval-odoo · 2025-06-25T14:41:26Z

I added ORDER BY LENGTH(commands) here

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num

as it could reduce the number of buckets if length is randomly distributed among ids.

aj-fuentes · 2025-06-25T15:29:56Z

src/util/spreadsheet/misc.py

+                   ARRAY_AGG(commands ORDER BY id)
+              FROM buckets
+          GROUP BY num, alone
+            """.format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"),


Use format_query to set ALL or ANY. Then pass memory_cap as parameter to the query in ncr.execute
EDIT: condition = util.SQLStr("ALL" if like_all else "ANY")

aj-fuentes · 2025-06-25T15:30:51Z

src/util/spreadsheet/misc.py

+            """
+            WITH buckets AS (
+                SELECT id,
+                       SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,


Suggested change

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / %s AS num,

aj-fuentes · 2025-06-25T15:31:02Z

src/util/spreadsheet/misc.py

+            WITH buckets AS (
+                SELECT id,
+                       SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,
+                       LENGTH(commands) > {memory_cap} AS alone,


Suggested change

LENGTH(commands) > {memory_cap} AS alone,

LENGTH(commands) > %s AS alone,

aj-fuentes · 2025-06-25T15:35:12Z

I added ORDER BY LENGTH(commands) here
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num
as it could reduce the number of buckets if length is randomly distributed among ids.

Note that this could still lead to a big record being grouped with many more "smaller" records potentially adding up to 200mb to the size of the fetched data --vs what would be fetched if the record was alone. If we don't avoid that case I don't see any advantage in ordering by length.

src/util/spreadsheet/misc.py

vval-odoo · 2025-06-26T08:25:15Z

Note that this could still lead to a big record being grouped with many more "smaller" records potentially adding up to 200mb to the size of the fetched data --vs what would be fetched if the record was alone. If we don't avoid that case I don't see any advantage in ordering by length.

That was the idea behind my original query. Compute buckets that would only go above memory_cap in case of lone record bigger than memory_cap. But for this we need to know for each row what will be the size of following one, so it complicates the logic a bit. Should I go with this one then?

        ncr.execute(
            """
            WITH RECURSIVE start_bucket AS (
                SELECT 1 AS bucket
            ),
            ordered_rows AS (
                SELECT id,
                       commands,
                       LENGTH(commands) AS length,
                       ROW_NUMBER() OVER (ORDER BY LENGTH(commands), id) AS rownum
                  FROM spreadsheet_revision
                 WHERE commands LIKE {condition}(%s::text[])
            ),
            assign AS (
                SELECT o.id AS id,
                       o.commands as commands,
                       o.length,
                       o.rownum,
                       sb.bucket AS bucket,
                       o.length AS sum
                  FROM ordered_rows o, start_bucket sb
                 WHERE o.rownum = 1

                UNION ALL

                SELECT o.id AS id,
                       o.commands as commands,
                       o.length,
                       o.rownum,
                       CASE
                            WHEN a.sum + o.length > %s THEN a.bucket + 1
                            ELSE a.bucket
                       END AS bucket,
                       CASE
                            WHEN a.sum + o.length > %s THEN o.length
                            ELSE a.sum + o.length
                       END AS sum
                  FROM assign a
                  JOIN ordered_rows o
                    ON o.rownum = a.rownum + 1
            )
            SELECT ARRAY_AGG(id),
                   ARRAY_AGG(commands)
              FROM assign
          GROUP BY bucket
            """.format(condition=pg.SQLStr("ALL" if like_all else "ANY")),
            [list(like_all or like_any), MEMORY_CAP, MEMORY_CAP]
        )

aj-fuentes · 2025-06-26T08:31:52Z

That was the idea behind my original query. Compute buckets that would only go above memory_cap in case of lone record bigger than memory_cap. But for this we need to know for each row what will be the size of following one, so it complicates the logic a bit. Should I go with this one then?

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket. Therefore if we group by num, alone we achieve our goal: big records won't be bucketed with smaller ones. You can think of it as having two groupings one for big records that will be fetched individually, and another for smaller records that would be clustered by bucket.

EDIT: note that only records bigger than bucket size needs to be set alone. Because all remaining records will never be fetched in a more than 2*MEMORY_CAP size group. Thus by choosing a cap that's acceptable (I think 200MB is fine as max 400MB would be fetched for smaller records, while bigger ones won't have any cap but fetched alone) we are done and we don't need any more complex logic here to decide about buckets. Here is a way to do this in a clearer way without the "magic" alone column (pseudo sql)

SELECT ARRAY[id], ARRAY[commands]
  FROM table
 WHERE len(commands)>cap
 UNION 
SELECT array_agg(data.id), array_agg(data.commands)
  FROM (SELECT id, commands, sum(len(commands))/cap 
          FROM table
         WHERE len(commands)<=cap
       ) AS data(id,commands,num)
 GROUP BY data.num

KangOl · 2025-06-26T08:46:22Z

we need to know for each row what will be the size of following one

To do so, you can use the lead window function: https://www.postgresql.org/docs/13/functions-window.html

But other window functions can be used to get memory_cap groups or rows.

KangOl · 2025-06-26T09:13:35Z

No need for lag/lead, we can just use a cumulative sum to get the ids to fetch in successive queries

  with _groups as (
    select id, sum(length(commands)) over (ORDER BY id) / {mem_cap} as cs
      from spreadsheet_revision
     where {condition}
  )
  select array_agg(id)
    from _groups
group by cs

Pirols · 2025-06-26T09:38:36Z

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket.

I don't think that the alone trick is enough, it still does not reason about wasted bucket space. Imagine the following scenario: bucket size is 10 chars, the first command (c0) is 6 chars long, second one (c1) is 6 chars long, third one (c2) is 5 chars long. No one qualifies as alone. c0 goes into b0 (first bucket), c1 goes into b1 (because 12 / 10 > 1), third command also goes into b1 (17/10 < 2), however len(c2) + len(c3) = 11 > 10 = bucket_size:

|----b0----|----b1----|----b2----|
 **c0**
            **c1**
                  *c2**

vval-odoo · 2025-06-26T09:40:19Z

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket. Therefore if we group by num, alone we achieve our goal: big records won't be bucketed with smaller ones. You can think of it as having two groupings one for big records that will be fetched individually, and another for smaller records that would be clustered by bucket.

I think that two records bigger than memory_cap can't get in the same bucket, even without the "alone" trick because of the division. What this tricks prevents is for smaller records to "complete" a bucket that is already too large, but that doesn't solve any problem because a bucket of 15,something * memory_cap would make the program crash as would a bucket of 15 * memory_cap. What I try to prevent is getting buckets between memory_cap and 2 * memory_cap.
I'll try what you proposed in your 'edit', but we have to agree that the real memory cap is 2 * memory_cap, right?

aj-fuentes · 2025-06-26T09:51:31Z

What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get alone=True, now it's impossible for two records bigger than a bucket to get alone=True in the same bucket.

I don't think that the alone trick is enough, it still does not reason about wasted bucket space. Imagine the following scenario: bucket size is 10 chars, the first command (c1) is 6 chars long, second one (c2) is 6 chars long, third one (c3) is 5 chars long. No one qualifies as alone. First command goes into b0 (first bucket), second command goes into b1 (because 12 / 10 > 1), third command also goes into b1 (17/10 < 2), however len(c2) + len(c3) = 11 > bucket_size = 10:
|----b0----|----b1----|----b2----|
 **c1**
            **c2**
                  *c3**

I think it is correct. See we don't care about 2*max_cap size since we know that's the hard max. We care about a really big record, think about 5*max_cap, getting even more smaller ones. In your example, yes c2+c3 > max_cap, but still < 2*max_cap. That's my point. The logic is simpler and as proven by the example here the biggest record is bigger than 2*max_cap and still acceptable for memory.

aj-fuentes · 2025-06-26T09:52:49Z

I'll try what you proposed in your 'edit', but we have to agree that the real memory cap is 2 * memory_cap, right?

@vval-odoo yes. Which is fine.

Pirols

Remove seemingly useless ordering?

Pirols · 2025-06-26T11:16:14Z

src/util/spreadsheet/misc.py

+            """
+            WITH buckets AS (
+                SELECT id,
+                       SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,


Suggested change

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,

SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands)) / {memory_cap} AS num,

The order is necessary. Example just to illustrate that we don't really want the sum to go over a random order.

=> select id,sum(id) over(order by id) from res_users order by id +----+-----+ | id | sum | |----+-----| | 1 | 1 | | 2 | 3 | | 3 | 6 | | 4 | 10 | +----+-----+ => select id,sum(id) over(order by id desc) from res_users order by id +----+-----+ | id | sum | |----+-----| | 1 | 10 | | 2 | 9 | | 3 | 7 | | 4 | 4 | +----+-----+

src/util/spreadsheet/misc.py

Some dbs have `spreadsheet_revision` records with over 10 millions characters in `commands`. If the number of record is high, this leads to memory errors. We distribute them in buckets of `memory_cap` maximum size, and use a named cursor to process them in buckets. Commands larger than `memory_cap` fit in one bucket.

KangOl · 2025-06-26T15:22:01Z

upgradeci retry

KangOl · 2025-06-26T15:33:31Z

upgradeci retry

aj-fuentes

Use format_query.

aj-fuentes · 2025-06-27T13:50:49Z

src/util/spreadsheet/misc.py

+                   ARRAY[commands]
+              FROM filtered
+             WHERE commands_length > %s
+            """.format(condition=pg.SQLStr("ALL" if like_all else "ANY")),


Suggested change

""".format(condition=pg.SQLStr("ALL" if like_all else "ANY")),

""",

condition=pg.SQLStr("ALL" if like_all else "ANY"),

) # close format_query

aj-fuentes · 2025-06-27T13:51:44Z

src/util/spreadsheet/misc.py

+
+    with pg.named_cursor(cr, itersize=1) as ncr:
+        ncr.execute(
+            """


Use format_query. The query is becoming increasingly complex, better use the right formatting tool to avoid issues later.

Suggested change

"""

util.format_query(

cr,

"""

vval-odoo requested review from a team, jjmaksoud and aj-fuentes June 11, 2025 08:49

Pirols reviewed Jun 11, 2025

View reviewed changes

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 3752d09 to 327a6f6 Compare June 11, 2025 11:16

aj-fuentes reviewed Jun 11, 2025

View reviewed changes

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

src/util/spreadsheet/misc.py Show resolved Hide resolved

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 327a6f6 to 508732d Compare June 13, 2025 12:20

vval-odoo requested a review from aj-fuentes June 20, 2025 07:56

aj-fuentes reviewed Jun 23, 2025

View reviewed changes

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 508732d to 1bfec7f Compare June 25, 2025 13:10

Pirols reviewed Jun 25, 2025

View reviewed changes

aj-fuentes reviewed Jun 25, 2025

View reviewed changes

src/util/spreadsheet/misc.py Show resolved Hide resolved

vval-odoo force-pushed the master-batch-process-big-commands-vval branch 2 times, most recently from 03175d2 to 3367e6d Compare June 25, 2025 14:39

aj-fuentes reviewed Jun 25, 2025

View reviewed changes

cawo-odoo reviewed Jun 26, 2025

View reviewed changes

src/util/spreadsheet/misc.py Outdated Show resolved Hide resolved

Pirols reviewed Jun 26, 2025

View reviewed changes

vval-odoo force-pushed the master-batch-process-big-commands-vval branch from 3367e6d to 3e92404 Compare June 26, 2025 13:39

aj-fuentes reviewed Jun 27, 2025

View reviewed changes

	ncr = pg.named_cursor(cr, itersize=BATCH_SIZE)
	with pg.named_cursor(cr, itersize=BATCH_SIZE) as ncr:

	for ids, datas in ncr.fetchmany(size=1):
	for ids, datas in ncr:

	with pg.named_cursor(cr) as ncr:
	with pg.named_cursor(cr, itersize=1) as ncr:

	SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num,
	SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / %s AS num,

	LENGTH(commands) > {memory_cap} AS alone,
	LENGTH(commands) > %s AS alone,

-            """.format(condition=pg.SQLStr("ALL" if like_all else "ANY")),
+            """,
+            condition=pg.SQLStr("ALL" if like_all else "ANY"),
+            )  # close format_query

-            """
+            util.format_query(
+                cr,
+                """

[FIX] spreadsheet: batch process spreadsheet_revision.commands #284

Are you sure you want to change the base?

[FIX] spreadsheet: batch process spreadsheet_revision.commands #284

Uh oh!

Conversation

vval-odoo commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robodoo commented Jun 11, 2025

Uh oh!

Pirols left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

KangOl commented Jun 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pirols commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vval-odoo commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aj-fuentes Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aj-fuentes commented Jun 25, 2025

Uh oh!

Uh oh!

vval-odoo commented Jun 26, 2025

Uh oh!

aj-fuentes commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KangOl commented Jun 26, 2025

Uh oh!

KangOl commented Jun 26, 2025

Uh oh!

Pirols commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vval-odoo commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aj-fuentes commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aj-fuentes commented Jun 26, 2025

Uh oh!

Pirols left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KangOl commented Jun 26, 2025

Uh oh!

KangOl commented Jun 26, 2025

Uh oh!

aj-fuentes left a comment

Choose a reason for hiding this comment

[FIX] spreadsheet: batch process `spreadsheet_revision.commands` #284

[FIX] spreadsheet: batch process `spreadsheet_revision.commands` #284

vval-odoo commented Jun 11, 2025 •

edited

Loading

vval-odoo commented Jun 25, 2025 •

edited

Loading

aj-fuentes Jun 25, 2025 •

edited

Loading

aj-fuentes commented Jun 26, 2025 •

edited

Loading

Pirols commented Jun 26, 2025 •

edited

Loading

vval-odoo commented Jun 26, 2025 •

edited

Loading

aj-fuentes commented Jun 26, 2025 •

edited

Loading