Skip to content

Deduplicate longest common subsequence logic #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

goretkin
Copy link
Contributor

This is pretty much purely an aesthetic suggestion to deduplicate the logic (prompted because I had to change == -> isequal in two places in #14 . This PR depends on that PR.)

I feel like the changes lie on a spectrum. For that reason, I broke the change out into a million commits. If the code at the last commit doesn't look so great, perhaps one just before is better. Also, each commit mostly represents a straightforward transformation.

src/arrays.jl Outdated
end
end
end

removed = Int[]
added = Int[]
backtrack(lengths, removed, added, X, Y, length(X), length(Y))

backtrack(lengths, backtracks, removed, added, X, Y, length(X), length(Y))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could be cleaned up if we store most of this information in a new struct

backtracks[1,2:end] .= :X
backtracks[2:end,1] .= :Y
backtracks[1,1] = :nothing
backtracks = fill((0, 0), axes(lengths))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't help readability. A enum or a custom struct will probably be better at getting storage efficiency and readibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of representation here is that it corresponds with index offsets here: https://github.com/ssfrr/DeepDiffs.jl/pull/15/files#diff-b13f4801e308864f0bd76da7a4928cc9R55

I believe any other choice would involve an eventual conversation to (0/1, 0/1), because that's really what they represent in the algorithm: walk backwards in X, Y, or both, or neither. Which isn't to say other choices wouldn't be better. Did you have any enum ideas?

I see the storage efficiency concern as separate (and still important). But I think that is best addressed by storing Tuple{Bool, Bool} instead of Tuple{Int, Int}.

sizeof(Tuple{Bool, Bool}) == 2. You could pack the bits more efficiently, but then I would expect time efficiency to suffer.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to add a comment though that explains what the two fields represent

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems like this would pretty much work as-is as fill((false,false), axes(lengths)).

end

(i, j) = ij
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to do this earlier and have:

i, j = ij
bt = backtracks[i, j]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better I think to just keep i and j as separate arguments to backtrack, then just have bti, btj = backtracks[i,j].

@ssfrr
Copy link
Owner

ssfrr commented May 15, 2020

Seems like an improvement. Is there a performance difference? It definitely needs to allocate more memory to keep the back-tracking info, but it's not clear whether there's a runtime impact (could be slower because of memory access, but also could be faster because of no branching). Also it's possible the backtracking isn't really a bottleneck because it's linear-time.

Would you mind running a before-after benchmark with some largish arrays?

else
lengths[i+1, j+1] = max(lengths[i+1, j], lengths[i, j+1])
lengths[i+1, j+1], backtracks[i+1, j+1] = _argmax(lengths[i+1, j], lengths[i, j+1])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think _argmax is a clear name. I'd just put that implementation here directly - this is the only place it's used, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants