-
Notifications
You must be signed in to change notification settings - Fork 68
[FIX] spreadsheet: batch process spreadsheet_revision.commands
#284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[FIX] spreadsheet: batch process spreadsheet_revision.commands
#284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! :)
src/util/spreadsheet/misc.py
Outdated
|
||
|
||
def iter_commands(cr, like_all=(), like_any=()): | ||
if not (bool(like_all) ^ bool(like_any)): | ||
raise ValueError("Please specify `like_all` or `like_any`, not both") | ||
cr.execute( | ||
ncr = pg.named_cursor(cr, itersize=BATCH_SIZE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a context manager you do not need to close it explicitely1.
ncr = pg.named_cursor(cr, itersize=BATCH_SIZE) | |
with pg.named_cursor(cr, itersize=BATCH_SIZE) as ncr: |
That said, this is just in the name of a more pythonic implementation. IOW: imo you can keep your current version, if you like it better.
Footnotes
3752d09
to
327a6f6
Compare
|
327a6f6
to
508732d
Compare
Another affected request: https://upgrade.odoo.com/odoo/action-150/2988031 |
508732d
to
1bfec7f
Compare
src/util/spreadsheet/misc.py
Outdated
""".format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"), | ||
[list(like_all or like_any)], | ||
) | ||
for ids, datas in ncr.fetchmany(size=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only output the first bucket.
A simple solution is what you had already:
for ids, datas in ncr.fetchmany(size=1): | |
for ids, datas in ncr: |
With with pg.named_cursor(cr, itersize=1) as ncr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Edoardo. You just need to limit the itersize. Under the hood psycopg will fetch one by one and each line already contains multiple records (as many as they fit in one bucket).
src/util/spreadsheet/misc.py
Outdated
cr.execute( | ||
"UPDATE spreadsheet_revision SET commands=%s WHERE id=%s", [json.dumps(data_loaded), revision_id] | ||
|
||
with pg.named_cursor(cr) as ncr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pg.named_cursor(cr) as ncr: | |
with pg.named_cursor(cr, itersize=1) as ncr: |
03175d2
to
3367e6d
Compare
I added
as it could reduce the number of buckets if length is randomly distributed among ids. |
src/util/spreadsheet/misc.py
Outdated
ARRAY_AGG(commands ORDER BY id) | ||
FROM buckets | ||
GROUP BY num, alone | ||
""".format(memory_cap=MEMORY_CAP, condition="ALL" if like_all else "ANY"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use format_query
to set ALL
or ANY
. Then pass memory_cap as parameter to the query in ncr.execute
EDIT: condition = util.SQLStr("ALL" if like_all else "ANY")
src/util/spreadsheet/misc.py
Outdated
""" | ||
WITH buckets AS ( | ||
SELECT id, | ||
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num, | |
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / %s AS num, |
src/util/spreadsheet/misc.py
Outdated
WITH buckets AS ( | ||
SELECT id, | ||
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num, | ||
LENGTH(commands) > {memory_cap} AS alone, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LENGTH(commands) > {memory_cap} AS alone, | |
LENGTH(commands) > %s AS alone, |
Note that this could still lead to a big record being grouped with many more "smaller" records potentially adding up to 200mb to the size of the fetched data --vs what would be fetched if the record was alone. If we don't avoid that case I don't see any advantage in ordering by length. |
That was the idea behind my original query. Compute buckets that would only go above
|
What's wrong with using the "alone" trick I sent you? If a record is bigger than the bucket it will get EDIT: note that only records bigger than bucket size needs to be set alone. Because all remaining records will never be fetched in a more than 2*MEMORY_CAP size group. Thus by choosing a cap that's acceptable (I think 200MB is fine as max 400MB would be fetched for smaller records, while bigger ones won't have any cap but fetched alone) we are done and we don't need any more complex logic here to decide about buckets. Here is a way to do this in a clearer way without the "magic" alone column (pseudo sql) SELECT ARRAY[id], ARRAY[commands]
FROM table
WHERE len(commands)>cap
UNION
SELECT array_agg(data.id), array_agg(data.commands)
FROM (SELECT id, commands, sum(len(commands))/cap
FROM table
WHERE len(commands)<=cap
) AS data(id,commands,num)
GROUP BY data.num |
To do so, you can use the But other window functions can be used to get |
No need for with _groups as (
select id, sum(length(commands)) over (ORDER BY id) / {mem_cap} as cs
from spreadsheet_revision
where {condition}
)
select array_agg(id)
from _groups
group by cs |
I don't think that the
|
I think that two records bigger than |
I think it is correct. See we don't care about |
@vval-odoo yes. Which is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove seemingly useless ordering?
src/util/spreadsheet/misc.py
Outdated
""" | ||
WITH buckets AS ( | ||
SELECT id, | ||
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands),id) / {memory_cap} AS num, | |
SUM(LENGTH(commands)) OVER (ORDER BY LENGTH(commands)) / {memory_cap} AS num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order is necessary. Example just to illustrate that we don't really want the sum to go over a random order.
=> select id,sum(id) over(order by id) from res_users order by id
+----+-----+
| id | sum |
|----+-----|
| 1 | 1 |
| 2 | 3 |
| 3 | 6 |
| 4 | 10 |
+----+-----+
=> select id,sum(id) over(order by id desc) from res_users order by id
+----+-----+
| id | sum |
|----+-----|
| 1 | 10 |
| 2 | 9 |
| 3 | 7 |
| 4 | 4 |
+----+-----+
Some dbs have `spreadsheet_revision` records with over 10 millions characters in `commands`. If the number of record is high, this leads to memory errors. We distribute them in buckets of `memory_cap` maximum size, and use a named cursor to process them in buckets. Commands larger than `memory_cap` fit in one bucket.
3367e6d
to
3e92404
Compare
upgradeci retry |
1 similar comment
upgradeci retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use format_query.
ARRAY[commands] | ||
FROM filtered | ||
WHERE commands_length > %s | ||
""".format(condition=pg.SQLStr("ALL" if like_all else "ANY")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""".format(condition=pg.SQLStr("ALL" if like_all else "ANY")), | |
""", | |
condition=pg.SQLStr("ALL" if like_all else "ANY"), | |
) # close format_query |
|
||
with pg.named_cursor(cr, itersize=1) as ncr: | ||
ncr.execute( | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use format_query
. The query is becoming increasingly complex, better use the right formatting tool to avoid issues later.
""" | |
util.format_query( | |
cr, | |
""" |
[FIX] spreadsheet: batch process
spreadsheet_revision.commands
Some dbs have
spreadsheet_revision
records with over 10 millions characters incommands
. If the number of record is high, this leads to memory errors here. We distribute them in buckets ofmemory_cap
maximum size, and use a named cursor to process them in buckets. Commands larger thanmemory_cap
fit in one bucket.Fixes upg-2899961