Skip to content

Speed bottleneck on deepcopy #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MicaelJarniac opened this issue May 30, 2022 · 11 comments · Fixed by #277
Closed

Speed bottleneck on deepcopy #152

MicaelJarniac opened this issue May 30, 2022 · 11 comments · Fixed by #277
Assignees
Labels
bug Something isn't working

Comments

@MicaelJarniac
Copy link

We're having issues with extremely slow queries while using the object mapper. The raw query itself isn't slow, but instancing the objects in Python seems to be.

While profiling, we noticed that there seems to be a bottleneck on the following line, using deepcopy:

copied_value = deepcopy(value)

We're not entirely sure what that deepcopy is for, so we tried removing it locally, and it seemed to result in a huge speed boost.

I believe simply removing it might break something, so I won't suggest doing that straight away, but I believe there might be a better approach to this that could avoid deepcopy when not necessary.

@ultrabug
Copy link
Collaborator

FYI @jsurloppe @dargor @wyfo since we're into Python Scylla perf analysis: maybe we're affected somehow too and investigating could help?

@MicaelJarniac please be kindly reminded that we maintain a fork of the official cassandra driver here so we don't have all the background and history on every design and legacy decision

The plan of ScyllaDB is to switch out from this fork someday and use the scylla rust driver as a sane base for other language drivers with bindings.

@fruch
Copy link

fruch commented Jun 26, 2022

I think this can easily be replaced by dict comprehension

        copied_value = {name:  field.to_python(value[name]) if value is not None or isinstance(field, BaseContainerColumn) else value[name] for name, field in self.user_type._fields.items() }
        return copied_value

@fruch fruch added the bug Something isn't working label Jun 26, 2022
@ultrabug
Copy link
Collaborator

@MicaelJarniac did you try what @fruch is proposing by any chance?

@wyfo
Copy link

wyfo commented Jul 16, 2022

I'm just wondering why is there or instead of and in

if copied_value[name] is not None or isinstance(field, BaseContainerColumn)

I mean, to_python seems to be a method of BaseContainerColumn, it's illogical to call it if only the first part of the test is true.

I also think there is a mistake in @fruch code because if value is not None should be if value[name] is not None.

I don't know this driver, so I don't know if there is some hidden implications for deepcopy removal; but it seems to me that it would be ok.
However, performance-wise, I would not use a dict-comprehension, but just replace deepcopy by a call to dict.copy.

$ python -m timeit -s "deepcopy = __import__(\"copy\").deepcopy; N = 10; value = dict(zip(range(N), range(N)))" "copied
 = deepcopy(value)" "for i in range(N):" "    if (val := value[i]) >= 5:" "        copied[i] = val + 1"
20000 loops, best of 5: 12.8 usec per loop
$ python -m timeit -s "N = 10; value = dict(zip(range(N), range(N)))" "{i: value[i] if value[i] < 5 else value[i] + 1 for i in range(N)}"
200000 loops, best of 5: 1.8 usec per loop
$ python -m timeit -s "N = 10; value = dict(zip(range(N), range(N)))" "copied = value.copy()" "for i in range(N):" "     if value[i] >= 5:" "        copied[i] = copied[i] + 1"
200000 loops, best of 5: 1.28 usec per loop

@fruch
Copy link

fruch commented Jul 17, 2022

@MicaelJarniac @wyfo, since this part of the code isn't scylla specific,
if this change is that much impact, I would suggest taking it upstream to https://github.com/datastax/python-driver (so the rest of the community would gain from it)

@MicaelJarniac
Copy link
Author

@MicaelJarniac @wyfo, since this part of the code isn't scylla specific, if this change is that much impact, I would suggest taking it upstream to https://github.com/datastax/python-driver (so the rest of the community would gain from it)

I was originally going to open this issue there, but they don't use GitHub Issues; they want us to create an account on another tracker or whatever, so I didn't bother to go through their bureaucracy.

But I agree, it'd make more sense for this to go there instead.

@MicaelJarniac did you try what @fruch is proposing by any chance?

I haven't tested it yet, sadly.

@MicaelJarniac
Copy link
Author

https://datastax-oss.atlassian.net/browse/PYTHON-1309

@fruch
Copy link

fruch commented Jul 19, 2022

@MicaelJarniac if you'll get a cold shoulder there, we could try applying those fixes here, but I'll first would want to enable the cqlengine integration tests, we are not really running them under this fork.

also if you guys some benchmark test you are using, it would be nice if you could share them (or even contribute them as tests)

@k0machi
Copy link

k0machi commented Nov 3, 2022

I've also encountered this issue with a Model that contains a UDT that contains a list of nested UDTs, this deepcopy call extends the query well into dozens of seconds in such a scenario. I see that upstream didn't take a look at the issue at all so maybe we should fix this issue on our end.

@k0machi
Copy link

k0machi commented Nov 30, 2023

After running cqlengine tests I have not found any regressions when completely removing the deepcopy in to_python. Removing deepcopy from to_database causes several tests to fail (as the source object now mutates when it is being serialized for the database.

I believe the rationale behind those deepcopies is to protect source object from modifications during db operations, however in to_python case I do not see a reason why it needs to be copied after being deserialized, so I think we may be able to remove it with no issues. In my tests I have found an almost 12x speedup when deserializing complex tables.

@k0machi k0machi self-assigned this Nov 30, 2023
@fruch
Copy link

fruch commented Nov 30, 2023

After running cqlengine tests I have not found any regressions when completely removing the deepcopy in to_python. Removing deepcopy from to_database causes several tests to fail (as the source object now mutates when it is being serialized for the database.

I believe the rationale behind those deepcopies is to protect source object from modifications during db operations, however in to_python case I do not see a reason why it needs to be copied after being deserialized, so I think we may be able to remove it with no issues. In my tests I have found an almost 12x speedup when deserializing complex tables.

Do a PR with this, and we'll put it on the next release

k0machi added a commit to k0machi/python-driver that referenced this issue Dec 1, 2023
This change makes it so newly instanced UserType during deserialization
isn't immediately copied by deepcopy, which could cause huge slowdown if
that UserType contains a lot of data or nested UserTypes, in which case the
deepcopy calls would cascade as each to_python call would eventually clone
parts of source object. As there isn't a lot of information on why this
deepcopy is here in the first place this change could potentially break
something. Running integration tests against this commit does not produce
regressions, so this call looks safe to remove, but I'm leaving this
warning here for the future reference.

Fixes scylladb#152
k0machi added a commit to k0machi/python-driver that referenced this issue Dec 4, 2023
This change makes it so newly instanced UserType during deserialization
isn't immediately copied by deepcopy, which could cause huge slowdown if
that UserType contains a lot of data or nested UserTypes, in which case the
deepcopy calls would cascade as each to_python call would eventually clone
parts of source object. As there isn't a lot of information on why this
deepcopy is here in the first place this change could potentially break
something. Running integration tests against this commit does not produce
regressions, so this call looks safe to remove, but I'm leaving this
warning here for the future reference.

Issue: scylladb#152
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants