Deduplicate longest common subsequence logic #15

goretkin · 2020-05-10T04:43:00Z

This is pretty much purely an aesthetic suggestion to deduplicate the logic (prompted because I had to change == -> isequal in two places in #14 . This PR depends on that PR.)

I feel like the changes lie on a spectrum. For that reason, I broke the change out into a million commits. If the code at the last commit doesn't look so great, perhaps one just before is better. Also, each commit mostly represents a straightforward transformation.

NaN/missing key values were already working, by virtue of using set operations on them, but test those anyway.

Intermediate commit which should not change execution behavior at all

src/arrays.jl

omus · 2020-05-12T13:49:56Z

src/arrays.jl

            end
        end
    end

    removed = Int[]
    added = Int[]
-    backtrack(lengths, removed, added, X, Y, length(X), length(Y))
+
+    backtrack(lengths, backtracks, removed, added, X, Y, length(X), length(Y))


I feel like this could be cleaned up if we store most of this information in a new struct

src/arrays.jl

omus · 2020-05-12T13:55:18Z

src/arrays.jl

-    backtracks[1,2:end] .= :X
-    backtracks[2:end,1] .= :Y
-    backtracks[1,1] = :nothing
+    backtracks = fill((0, 0), axes(lengths))


Doesn't help readability. A enum or a custom struct will probably be better at getting storage efficiency and readibility

The choice of representation here is that it corresponds with index offsets here: https://github.com/ssfrr/DeepDiffs.jl/pull/15/files#diff-b13f4801e308864f0bd76da7a4928cc9R55

I believe any other choice would involve an eventual conversation to (0/1, 0/1), because that's really what they represent in the algorithm: walk backwards in X, Y, or both, or neither. Which isn't to say other choices wouldn't be better. Did you have any enum ideas?

I see the storage efficiency concern as separate (and still important). But I think that is best addressed by storing Tuple{Bool, Bool} instead of Tuple{Int, Int}.

sizeof(Tuple{Bool, Bool}) == 2. You could pack the bits more efficiently, but then I would expect time efficiency to suffer.

probably want to add a comment though that explains what the two fields represent

Yeah, seems like this would pretty much work as-is as fill((false,false), axes(lengths)).

omus · 2020-05-12T13:59:24Z

src/arrays.jl

    end

+    (i, j) = ij


Probably better to do this earlier and have:

i, j = ij bt = backtracks[i, j]

better I think to just keep i and j as separate arguments to backtrack, then just have bti, btj = backtracks[i,j].

Co-authored-by: Curtis Vogt <[email protected]>

ssfrr · 2020-05-15T01:23:09Z

Seems like an improvement. Is there a performance difference? It definitely needs to allocate more memory to keep the back-tracking info, but it's not clear whether there's a runtime impact (could be slower because of memory access, but also could be faster because of no branching). Also it's possible the backtracking isn't really a bottleneck because it's linear-time.

Would you mind running a before-after benchmark with some largish arrays?

ssfrr · 2020-05-15T01:44:27Z

src/arrays.jl

            else
-                lengths[i+1, j+1] = max(lengths[i+1, j], lengths[i, j+1])
+                lengths[i+1, j+1], backtracks[i+1, j+1] = _argmax(lengths[i+1, j], lengths[i, j+1])


I don't think _argmax is a clear name. I'd just put that implementation here directly - this is the only place it's used, right?

goretkin added 15 commits May 9, 2020 22:55

Add tests where == fails

aeb7ba6

Fix Vector equality check

bd845f7

Fix equality of diffs

95746e0

Fix comparison of dicts

19cc640

NaN/missing key values were already working, by virtue of using set operations on them, but test those anyway.

Mirror backtrack logic

e137c1a

Intermediate commit which should not change execution behavior at all

Restore some beauty

1ce7ea6

Baby step

1e27f5e

Baby steps

2df509f

Boundary logic to boundary data

d9b0a2b

Replace :X,:Y,:XY,:nothing with their coordinate representations

70d7207

Use coordinates to update i,j

7d2d100

Factor out recursive call

388b023

Remove unused args

7288ab6

Really bikeshedding

3189a1a

Bikeshedded

fbc7ff8

omus reviewed May 12, 2020

View reviewed changes

Apply suggestions from code review

507e53d

Co-authored-by: Curtis Vogt <[email protected]>

ssfrr reviewed May 15, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deduplicate longest common subsequence logic #15

Deduplicate longest common subsequence logic #15

Uh oh!

goretkin commented May 10, 2020

Uh oh!

Uh oh!

omus May 12, 2020

Uh oh!

Uh oh!

omus May 12, 2020

Uh oh!

goretkin May 14, 2020

Uh oh!

ssfrr May 15, 2020

Uh oh!

ssfrr May 15, 2020

Uh oh!

omus May 12, 2020

Uh oh!

ssfrr May 15, 2020

Uh oh!

ssfrr commented May 15, 2020

Uh oh!

ssfrr May 15, 2020

Uh oh!

Uh oh!

Deduplicate longest common subsequence logic #15

Are you sure you want to change the base?

Deduplicate longest common subsequence logic #15

Uh oh!

Conversation

goretkin commented May 10, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssfrr commented May 15, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!