-
Notifications
You must be signed in to change notification settings - Fork 10
Deduplicate longest common subsequence logic #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
aeb7ba6
bd845f7
95746e0
19cc640
e137c1a
1ce7ea6
1e27f5e
2df509f
d9b0a2b
70d7207
7d2d100
388b023
7288ab6
3189a1a
fbc7ff8
507e53d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,6 +13,8 @@ changed(diff::VectorDiff) = Int[] | |
|
||
Base.:(==)(d1::VectorDiff, d2::VectorDiff) = fieldequal(d1, d2) | ||
|
||
_argmax(x, y) = x ≥ y ? (x, (0, 1)) : (y, (1, 0)) | ||
|
||
# diffing an array is an application of the Longest Common Subsequence problem: | ||
# https://en.wikipedia.org/wiki/Longest_common_subsequence_problem | ||
function deepdiff(X::Vector, Y::Vector) | ||
|
@@ -21,35 +23,43 @@ function deepdiff(X::Vector, Y::Vector) | |
# substrings. | ||
|
||
lengths = zeros(Int, length(X)+1, length(Y)+1) | ||
backtracks = fill((0, 0), axes(lengths)) | ||
backtracks[1,2:end] .= Ref((0, 1)) | ||
backtracks[2:end,1] .= Ref((1, 0)) | ||
backtracks[1,1] = (0, 0) | ||
|
||
for (j, v2) in enumerate(Y) | ||
for (i, v1) in enumerate(X) | ||
if v1 == v2 | ||
if isequal(v1, v2) | ||
lengths[i+1, j+1] = lengths[i, j] + 1 | ||
backtracks[i+1, j+1] = (1, 1) | ||
else | ||
lengths[i+1, j+1] = max(lengths[i+1, j], lengths[i, j+1]) | ||
lengths[i+1, j+1], backtracks[i+1, j+1] = _argmax(lengths[i+1, j], lengths[i, j+1]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think |
||
end | ||
end | ||
end | ||
|
||
removed = Int[] | ||
added = Int[] | ||
backtrack(lengths, removed, added, X, Y, length(X), length(Y)) | ||
|
||
backtrack(backtracks, removed, added, (length(X)+1, length(Y)+1)) | ||
|
||
VectorDiff(X, Y, removed, added) | ||
end | ||
|
||
# recursively trace back the longest common subsequence, adding items | ||
# to the added and removed lists as we go | ||
function backtrack(lengths, removed, added, X, Y, i, j) | ||
if i > 0 && j > 0 && X[i] == Y[j] | ||
backtrack(lengths, removed, added, X, Y, i-1, j-1) | ||
elseif j > 0 && (i == 0 || lengths[i+1, j] ≥ lengths[i, j+1]) | ||
backtrack(lengths, removed, added, X, Y, i, j-1) | ||
push!(added, j) | ||
elseif i > 0 && (j == 0 || lengths[i+1, j] < lengths[i, j+1]) | ||
backtrack(lengths, removed, added, X, Y, i-1, j) | ||
push!(removed, i) | ||
function backtrack(backtracks, removed, added, ij) | ||
bt = backtracks[ij...] | ||
if bt != (0, 0) | ||
backtrack(backtracks, removed, added, ij .- bt) | ||
end | ||
|
||
(i, j) = ij | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably better to do this earlier and have: i, j = ij
bt = backtracks[i, j] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. better I think to just keep |
||
if bt == (0, 1) | ||
push!(added, j-1) | ||
elseif bt == (1, 0) | ||
push!(removed, i-1) | ||
end | ||
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't help readability. A enum or a custom struct will probably be better at getting storage efficiency and readibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of representation here is that it corresponds with index offsets here: https://github.com/ssfrr/DeepDiffs.jl/pull/15/files#diff-b13f4801e308864f0bd76da7a4928cc9R55
I believe any other choice would involve an eventual conversation to
(0/1, 0/1)
, because that's really what they represent in the algorithm: walk backwards in X, Y, or both, or neither. Which isn't to say other choices wouldn't be better. Did you have any enum ideas?I see the storage efficiency concern as separate (and still important). But I think that is best addressed by storing
Tuple{Bool, Bool}
instead ofTuple{Int, Int}
.sizeof(Tuple{Bool, Bool}) == 2
. You could pack the bits more efficiently, but then I would expect time efficiency to suffer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably want to add a comment though that explains what the two fields represent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems like this would pretty much work as-is as
fill((false,false), axes(lengths))
.