Skip to content

Fixes Edge Cases with #167 #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 16, 2018
Merged

Fixes Edge Cases with #167 #169

merged 5 commits into from
Nov 16, 2018

Conversation

JosephTomlinson
Copy link
Contributor

@JosephTomlinson JosephTomlinson commented Sep 18, 2018

The previous solution (#167 ) was incorrect when the range resulted in a whole number output.
For example a length 50 array being split over 4 was resulting in 13,12,13,12.

The new solution is less elegant, but doesn't rely on using range so has no issues with rounding being inconsistent.

I am unsure how to actually write a test for this as all the ways I can think to do so depend on me indexing OTHERIDS which seems to throw an exception.

@vchuravy
Copy link
Member

Maybe a direct test on defaultdist would be sufficient to show that it yields the kinds of cuts you want.

@JosephTomlinson
Copy link
Contributor Author

Okay I've added in an edge case as a test and it seems to work. Any idea why the OSX builds all fail during the subdarray conversion test? Not sure how what I've done could affect that or why it's OSX only.

@vchuravy
Copy link
Member

Odd #170 also shows mac-specific behaviour.

@vchuravy vchuravy closed this Sep 24, 2018
@vchuravy vchuravy reopened this Sep 24, 2018
@vchuravy vchuravy added this to the v0.6 milestone Oct 1, 2018
@JosephTomlinson
Copy link
Contributor Author

So I've had some time to investigate this and it appears to not be OS specific but actually number of processes specific, at least as far as I can tell.
The specific error seems to be that when called with 3 workers the length 5x4 view when copied is split into 2x4, 2x4, 1x4 and the 1x4 seems to be causing issues.

Specifically in function Base.:(==)(d::Darray{<:Any, <:Any, A} , a::AbstractArray) where A
Inside the async map a is indexed with a[localindices(d)...] which seems to be causing an issue with the 1x4 index block. If you print the result of the above indexing statement on its own you get correct results for worker 2&3 but worker 4 prints [#undef, #undef, #undef, #undef].
Note that if you add a print statement to test d[localindices(d)...] you get the same undef results.

When I manually index, i.e. test s[end:end, :] directly, the original view is fine, but if I copy it then manually index, copy(s)[end:end, :], I get the same undef results and indexing a converted view, Darray(s)[end, :], again results in undef output.

This suggest to me that there is actually a bug in copy and/or the relevant Darray constructor but I'm unsure how to proceed or if I've missed something.

An additional concern is why the previous distribution did not produce this error since it seems independent of that, unless maybe it only errors when the length 1 index is end?

@vchuravy
Copy link
Member

vchuravy commented Oct 2, 2018

Ok thanks for debugging that, if you have a test-case that is independent of the number of processors that would help greatly. I will look into this, but it will take me some time.

@JosephTomlinson
Copy link
Contributor Author

JosephTomlinson commented Oct 2, 2018

For a number of processors independent way to reproduce the bug you should be able to use the following (note that while reproducing the bug I discovered that this only occurs for an odd number of workers, I can't rationalize why that matters though)

using Distributed
const np = 3
addprocs(np)
@everywhere using DistributedArrays

a = drand(20, 20);

s = view(a, 1:(2*np)-1, 5:8)
sc = DArray(s)

println(s[end,1:4])
println(sc[end,1:4])
println(copy(s)[end,1:4])
s == sc #This is where the error actually happens, presumably through converting the undef

@JosephTomlinson
Copy link
Contributor Author

So I finally got the tests to pass, there was an issue that only occured for small arrays on a small number of processes but it has been fixed now. I believe this should be good to merge.

@vchuravy vchuravy merged commit 17b77d7 into JuliaParallel:master Nov 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants