Skip to content

Deduplicate longest common subsequence logic #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/DeepDiffs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ export SimpleDiff, VectorDiff, StringDiff, DictDiff
# Helper function for comparing two instances of a type for equality by field
function fieldequal(x::T, y::T) where T
for f in fieldnames(T)
getfield(x, f) == getfield(y, f) || return false
isequal(getfield(x, f), getfield(y, f)) || return false
end
true
end
Expand Down
34 changes: 22 additions & 12 deletions src/arrays.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ changed(diff::VectorDiff) = Int[]

Base.:(==)(d1::VectorDiff, d2::VectorDiff) = fieldequal(d1, d2)

_argmax(x, y) = x ≥ y ? (x, (0, 1)) : (y, (1, 0))

# diffing an array is an application of the Longest Common Subsequence problem:
# https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
function deepdiff(X::Vector, Y::Vector)
Expand All @@ -21,35 +23,43 @@ function deepdiff(X::Vector, Y::Vector)
# substrings.

lengths = zeros(Int, length(X)+1, length(Y)+1)
backtracks = fill((0, 0), axes(lengths))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't help readability. A enum or a custom struct will probably be better at getting storage efficiency and readibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of representation here is that it corresponds with index offsets here: https://github.com/ssfrr/DeepDiffs.jl/pull/15/files#diff-b13f4801e308864f0bd76da7a4928cc9R55

I believe any other choice would involve an eventual conversation to (0/1, 0/1), because that's really what they represent in the algorithm: walk backwards in X, Y, or both, or neither. Which isn't to say other choices wouldn't be better. Did you have any enum ideas?

I see the storage efficiency concern as separate (and still important). But I think that is best addressed by storing Tuple{Bool, Bool} instead of Tuple{Int, Int}.

sizeof(Tuple{Bool, Bool}) == 2. You could pack the bits more efficiently, but then I would expect time efficiency to suffer.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to add a comment though that explains what the two fields represent

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems like this would pretty much work as-is as fill((false,false), axes(lengths)).

backtracks[1,2:end] .= Ref((0, 1))
backtracks[2:end,1] .= Ref((1, 0))
backtracks[1,1] = (0, 0)

for (j, v2) in enumerate(Y)
for (i, v1) in enumerate(X)
if v1 == v2
if isequal(v1, v2)
lengths[i+1, j+1] = lengths[i, j] + 1
backtracks[i+1, j+1] = (1, 1)
else
lengths[i+1, j+1] = max(lengths[i+1, j], lengths[i, j+1])
lengths[i+1, j+1], backtracks[i+1, j+1] = _argmax(lengths[i+1, j], lengths[i, j+1])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think _argmax is a clear name. I'd just put that implementation here directly - this is the only place it's used, right?

end
end
end

removed = Int[]
added = Int[]
backtrack(lengths, removed, added, X, Y, length(X), length(Y))

backtrack(backtracks, removed, added, (length(X)+1, length(Y)+1))

VectorDiff(X, Y, removed, added)
end

# recursively trace back the longest common subsequence, adding items
# to the added and removed lists as we go
function backtrack(lengths, removed, added, X, Y, i, j)
if i > 0 && j > 0 && X[i] == Y[j]
backtrack(lengths, removed, added, X, Y, i-1, j-1)
elseif j > 0 && (i == 0 || lengths[i+1, j] ≥ lengths[i, j+1])
backtrack(lengths, removed, added, X, Y, i, j-1)
push!(added, j)
elseif i > 0 && (j == 0 || lengths[i+1, j] < lengths[i, j+1])
backtrack(lengths, removed, added, X, Y, i-1, j)
push!(removed, i)
function backtrack(backtracks, removed, added, ij)
bt = backtracks[ij...]
if bt != (0, 0)
backtrack(backtracks, removed, added, ij .- bt)
end

(i, j) = ij
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to do this earlier and have:

i, j = ij
bt = backtracks[i, j]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better I think to just keep i and j as separate arguments to backtrack, then just have bti, btj = backtracks[i,j].

if bt == (0, 1)
push!(added, j-1)
elseif bt == (1, 0)
push!(removed, i-1)
end
end

Expand Down
2 changes: 1 addition & 1 deletion src/dicts.jl
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ function deepdiff(X::AbstractDict, Y::AbstractDict)
changed = Dict{eltype(bothkeys), DeepDiff}()

for key in bothkeys
if X[key] != Y[key]
if !isequal(X[key], Y[key])
changed[key] = deepdiff(X[key], Y[key])
else
push!(unchanged, key)
Expand Down
9 changes: 9 additions & 0 deletions test/arrays.jl
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,13 @@
d = deepdiff(a1, [2, 4])
@test removed(d) == [1, 3]
@test added(d) == []

d = deepdiff([NaN], [NaN])
@test removed(d) == added(d) == []

d = deepdiff([missing], [missing])
@test removed(d) == added(d) == []

d = deepdiff([NaN], [])
@test d == d
end
26 changes: 26 additions & 0 deletions test/dicts.jl
Original file line number Diff line number Diff line change
Expand Up @@ -86,4 +86,30 @@
@test removed(d) == Set()
@test changed(d) == Dict()
end

@testset "missing/NaN" begin
dnan = Dict(:d=>NaN)
d = deepdiff(dnan, dnan)
@test added(d) == removed(d) == Set()
@test changed(d) == Dict()

dmis = Dict(:d=>missing)
d = deepdiff(dmis, dmis)
@test added(d) == removed(d) == Set()
@test changed(d) == Dict()

dnank = Dict(NaN=>true)
d = deepdiff(dnan, dnan)
@test added(d) == removed(d) == Set()
@test changed(d) == Dict()

dmisk = Dict(missing=>true)
d = deepdiff(dmis, dmis)
@test added(d) == removed(d) == Set()
@test changed(d) == Dict()

d = DeepDiffs.deepdiff(dnank, dmisk)
@test d == d

end
end