-
Notifications
You must be signed in to change notification settings - Fork 48
Speed bottleneck on deepcopy #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
FYI @jsurloppe @dargor @wyfo since we're into Python Scylla perf analysis: maybe we're affected somehow too and investigating could help? @MicaelJarniac please be kindly reminded that we maintain a fork of the official cassandra driver here so we don't have all the background and history on every design and legacy decision The plan of ScyllaDB is to switch out from this fork someday and use the scylla rust driver as a sane base for other language drivers with bindings. |
I think this can easily be replaced by dict comprehension copied_value = {name: field.to_python(value[name]) if value is not None or isinstance(field, BaseContainerColumn) else value[name] for name, field in self.user_type._fields.items() }
return copied_value |
@MicaelJarniac did you try what @fruch is proposing by any chance? |
I'm just wondering why is there if copied_value[name] is not None or isinstance(field, BaseContainerColumn) I mean, I also think there is a mistake in @fruch code because I don't know this driver, so I don't know if there is some hidden implications for $ python -m timeit -s "deepcopy = __import__(\"copy\").deepcopy; N = 10; value = dict(zip(range(N), range(N)))" "copied
= deepcopy(value)" "for i in range(N):" " if (val := value[i]) >= 5:" " copied[i] = val + 1"
20000 loops, best of 5: 12.8 usec per loop
$ python -m timeit -s "N = 10; value = dict(zip(range(N), range(N)))" "{i: value[i] if value[i] < 5 else value[i] + 1 for i in range(N)}"
200000 loops, best of 5: 1.8 usec per loop
$ python -m timeit -s "N = 10; value = dict(zip(range(N), range(N)))" "copied = value.copy()" "for i in range(N):" " if value[i] >= 5:" " copied[i] = copied[i] + 1"
200000 loops, best of 5: 1.28 usec per loop |
@MicaelJarniac @wyfo, since this part of the code isn't scylla specific, |
I was originally going to open this issue there, but they don't use GitHub Issues; they want us to create an account on another tracker or whatever, so I didn't bother to go through their bureaucracy. But I agree, it'd make more sense for this to go there instead.
I haven't tested it yet, sadly. |
@MicaelJarniac if you'll get a cold shoulder there, we could try applying those fixes here, but I'll first would want to enable the cqlengine integration tests, we are not really running them under this fork. also if you guys some benchmark test you are using, it would be nice if you could share them (or even contribute them as tests) |
I've also encountered this issue with a Model that contains a UDT that contains a list of nested UDTs, this |
After running cqlengine tests I have not found any regressions when completely removing the deepcopy in I believe the rationale behind those deepcopies is to protect source object from modifications during db operations, however in |
Do a PR with this, and we'll put it on the next release |
This change makes it so newly instanced UserType during deserialization isn't immediately copied by deepcopy, which could cause huge slowdown if that UserType contains a lot of data or nested UserTypes, in which case the deepcopy calls would cascade as each to_python call would eventually clone parts of source object. As there isn't a lot of information on why this deepcopy is here in the first place this change could potentially break something. Running integration tests against this commit does not produce regressions, so this call looks safe to remove, but I'm leaving this warning here for the future reference. Fixes scylladb#152
This change makes it so newly instanced UserType during deserialization isn't immediately copied by deepcopy, which could cause huge slowdown if that UserType contains a lot of data or nested UserTypes, in which case the deepcopy calls would cascade as each to_python call would eventually clone parts of source object. As there isn't a lot of information on why this deepcopy is here in the first place this change could potentially break something. Running integration tests against this commit does not produce regressions, so this call looks safe to remove, but I'm leaving this warning here for the future reference. Issue: scylladb#152
We're having issues with extremely slow queries while using the object mapper. The raw query itself isn't slow, but instancing the objects in Python seems to be.
While profiling, we noticed that there seems to be a bottleneck on the following line, using
deepcopy
:python-driver/cassandra/cqlengine/columns.py
Line 1041 in 94b64bb
We're not entirely sure what that deepcopy is for, so we tried removing it locally, and it seemed to result in a huge speed boost.
I believe simply removing it might break something, so I won't suggest doing that straight away, but I believe there might be a better approach to this that could avoid deepcopy when not necessary.
The text was updated successfully, but these errors were encountered: