-
Notifications
You must be signed in to change notification settings - Fork 10
Deduplicate longest common subsequence logic #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
NaN/missing key values were already working, by virtue of using set operations on them, but test those anyway.
Intermediate commit which should not change execution behavior at all
src/arrays.jl
Outdated
end | ||
end | ||
end | ||
|
||
removed = Int[] | ||
added = Int[] | ||
backtrack(lengths, removed, added, X, Y, length(X), length(Y)) | ||
|
||
backtrack(lengths, backtracks, removed, added, X, Y, length(X), length(Y)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this could be cleaned up if we store most of this information in a new struct
backtracks[1,2:end] .= :X | ||
backtracks[2:end,1] .= :Y | ||
backtracks[1,1] = :nothing | ||
backtracks = fill((0, 0), axes(lengths)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't help readability. A enum or a custom struct will probably be better at getting storage efficiency and readibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of representation here is that it corresponds with index offsets here: https://github.com/ssfrr/DeepDiffs.jl/pull/15/files#diff-b13f4801e308864f0bd76da7a4928cc9R55
I believe any other choice would involve an eventual conversation to (0/1, 0/1)
, because that's really what they represent in the algorithm: walk backwards in X, Y, or both, or neither. Which isn't to say other choices wouldn't be better. Did you have any enum ideas?
I see the storage efficiency concern as separate (and still important). But I think that is best addressed by storing Tuple{Bool, Bool}
instead of Tuple{Int, Int}
.
sizeof(Tuple{Bool, Bool}) == 2
. You could pack the bits more efficiently, but then I would expect time efficiency to suffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably want to add a comment though that explains what the two fields represent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems like this would pretty much work as-is as fill((false,false), axes(lengths))
.
end | ||
|
||
(i, j) = ij |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to do this earlier and have:
i, j = ij
bt = backtracks[i, j]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better I think to just keep i
and j
as separate arguments to backtrack
, then just have bti, btj = backtracks[i,j]
.
Co-authored-by: Curtis Vogt <[email protected]>
Seems like an improvement. Is there a performance difference? It definitely needs to allocate more memory to keep the back-tracking info, but it's not clear whether there's a runtime impact (could be slower because of memory access, but also could be faster because of no branching). Also it's possible the backtracking isn't really a bottleneck because it's linear-time. Would you mind running a before-after benchmark with some largish arrays? |
else | ||
lengths[i+1, j+1] = max(lengths[i+1, j], lengths[i, j+1]) | ||
lengths[i+1, j+1], backtracks[i+1, j+1] = _argmax(lengths[i+1, j], lengths[i, j+1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think _argmax
is a clear name. I'd just put that implementation here directly - this is the only place it's used, right?
This is pretty much purely an aesthetic suggestion to deduplicate the logic (prompted because I had to change
==
->isequal
in two places in #14 . This PR depends on that PR.)I feel like the changes lie on a spectrum. For that reason, I broke the change out into a million commits. If the code at the last commit doesn't look so great, perhaps one just before is better. Also, each commit mostly represents a straightforward transformation.