Skip to content

Parallel indexed iterator #279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
msiglreith opened this issue Mar 21, 2017 · 14 comments
Closed

Parallel indexed iterator #279

msiglreith opened this issue Mar 21, 2017 · 14 comments
Assignees

Comments

@msiglreith
Copy link
Contributor

IndexedIter and IndexedIterMut should have parallel iterator support.

Often it's desirable to have the element coordinates additionally to the element itself for accessing fields in other arrays. Example use case (part of fluid simulation code):

div.indexed_iter_mut().map(|((y, x), div)| {
    *div = -(vel.x[(y, x+1)] - vel.x[(y, x)] + vel.y[(y+1, x)] - vel.y[(y, x)]);
});

I quickly tried to implement it myself, similar to the AxisIter implementation, but unsure how to properly split the iterator.

@bluss
Copy link
Member

bluss commented Mar 21, 2017

Right. I think the indexed iterator is already slow, I fear.

@mokasin
Copy link

mokasin commented Mar 23, 2017

Related to that, I'm wondering how to compute something like a gradient on a field efficiently in parallel. This means, I need to assign a value to every array element, that depends on values in neighbouring indices. I tried this very (silly)naive approach, here simplified to 1D:

extern crate rayon;
extern crate ndarray;

use rayon::prelude::*;
use ndarray::Array;

fn main() {
    let a = Array::from_elem(10, 1);
    let mut b = Array::zeros(10);

    (1usize..8)
        .into_par_iter()
        .map(|i|
             b[i] = a[i+1] - a[i-1]
        );

    println!("array: {:?}", b);
}

which fails miserably:

error[E0387]: cannot borrow data mutably in a captured outer variable in an `Fn` closure
  --> src/main.rs:14:14
   |
14 |              b[i] = a[i+1] - a[i-1]
   |              ^
   |
help: consider changing this closure to take self by mutable reference
  --> src/main.rs:13:14
   |
13 |           .map(|i|
   |  ______________^ starting here...
14 | |              b[i] = a[i+1] - a[i-1]
   | |___________________________________^ ...ending here

I'd like to do something like this for an n-dimensional array. Besides the point, that this might split very cache unfriendly: How to do that?

@bluss
Copy link
Member

bluss commented Mar 23, 2017

Right and I would shoot for vectorization (SIMD parallelization) before thread based parallelization.

@mokasin We want to add a parallel version of array zips, which should address this on a basic level. For 1D we can of course use collect_into on a Vec and then make an array, since Vec → Array conversion has no allocation/copy cost.

@bluss
Copy link
Member

bluss commented Mar 23, 2017

The example here is not about threading, but it shows an example that can inspire a solution that computes the gradients using array wide operations. That will achieve some limited degree of simd parallelization at least.

https://bluss.github.io/rust-ndarray/master/ndarray/macro.s.html

@bluss
Copy link
Member

bluss commented Mar 28, 2017

The n-ary Zip work brings some new perspectives. It's not very simple to split an index iterator, but we can split a producer (Think of it like an IntoIterator, it doesn't have the problem of having a partially iterated state).

@bluss
Copy link
Member

bluss commented Mar 31, 2017

Ok, Zip now exists.

@msiglreith Let's try the idea that we don't need indices! The mapping x, y → y, x is a transpose, so that's easy. And with some slicing we can get both the current and the next element along an axis. Something like the following code. Maybe one should double check that the computation is correct, and also think hard about how to handle borders.

One can play with the memory layouts here to have it perform well. Div and vx, vy need to have opposite layouts since we access vx, vy transposed. I picked vx, vy to have f order, so that the sliced transposed arrays have contiguous rows, which is the case ndarray optimizes for by preference (It will be more open minded in the future)

Zip parallelization is basically ready, but needs rayon-git: #288 (It's possible I'll merge this soon enough, but can't publish before rayon has a new version.) Using parallelization is as simple as switching from apply to par_apply, when ndarray-parallel has the changes.

extern crate rand;
#[macro_use(s, azip)]
extern crate ndarray;
extern crate ndarray_rand;

use ndarray::prelude::*;
use ndarray_rand::RandomExt;
use rand::distributions::Range;
use ndarray::Zip;

fn main() {
    type Arrayf64<D> = Array<f64, D>;
    const N: usize = 10;
    let vx = Arrayf64::random((N, N).f(), Range::new(0., 100.));
    let vy = Arrayf64::random((N, N).f(), Range::new(0., 100.));
    println!("{:8.3}", vx);
    println!("{:8.3}", vy);
    let mut div = Arrayf64::zeros((N - 1, N - 1));

    Zip::from(&mut div)
        .and(vx.slice(s![..-1, ..-1]).t())
        .and(vx.slice(s![1.., ..-1]).t())
        .and(vy.slice(s![..-1, ..-1]).t())
        .and(vy.slice(s![..-1, 1..]).t())
        .apply(|div, &vx1, &vx2, &vy1, &vy2| {
            *div = -(vx1 - vx2 + vy1 - vy2);
        });
    println!("{:8.3}", div);
}

@bluss
Copy link
Member

bluss commented Mar 31, 2017

@mokasin We're going to add a window producer/iterable but I actually think something array wide is much faster that iterating the array in windows around each element.

Something array wide would indeed be something like:

b += a[2..] - a[..-2]   // pseudocode because Rust slicing is not this nice

@msiglreith
Copy link
Contributor Author

Great work!
Tried it in my code and seems to work fine (no visual difference in the resulting flow).

ndarray::Zip::from(div)
        .and(vel.x.slice(s![.., 1..]))
        .and(vel.x.slice(s![.., ..-1]))
        .and(vel.y.slice(s![1.., ..]))
        .and(vel.y.slice(s![..-1, ..]))
        .apply(|div, &vx1, &vx2, &vy1, &vy2| {
            *div = -(vx1 - vx2 + vy1 - vy2);
        });

I'm using a staggered grid, so boundary conditions are not really an issue, expect I needed to adjust the dimensions of the grids (vel.x and vel.y are larger than div). Therefore I added the additional slice operations .slice(s![.., ..-1]) and .slice(s![..-1, ..]).

A drawback of this approach is slightly less readability imo. I had to read up the documentation first to grasp the whole concept and even tough direct indexing is easier to read.
Currently most of my code is based on tuple based indexing, some computational heavy parts are optimized by direct array index calculation. Zip based approach might serve as intermediate optimization step.

Divergence calculation isn't a performance heavy step for my fluid simulation. Even tough, is there an overhead for using this zip based approach in comparison to array indexing?
Direct indexing sample:

let (h, w) = div.dim();

let mut div = div.as_slice_mut().unwrap();
let mut vx = vel.x.as_slice().unwrap();
let mut vy = vel.y.as_slice().unwrap();

let mut idx = 0;
    
for y in 0 .. h {
   for x in 0 .. w {
       div[idx] = -(vx[idx+y+1] - vx[idx+y] + vy[idx+w] - vy[idx]);
       idx += 1;
     }
 }

Thanks again for your work on this issue, I appreciate it!

@bluss
Copy link
Member

bluss commented Mar 31, 2017

Zip is created so that we have a safe Rust way to do this efficiently. It looks at the memory layouts of the inputs and tries to do its best. (If all inputs are C/F contiguous it uses a linear index through all array memories, otherwise it “unrolls” the innermost dimension and uses each array's stride × an index to loop over it).

That means if you take care with the memory layouts (see previous message about that), Zip should always win. There are no bounds checks, like there will be in your slice indexing example. I'm pretty sure the bounds checks don't optimize out in your example (but I would be happy to be positively surprised).

Note that this:

Zip::from(&mut div)
        .and(vx.slice(s![..-1, ..-1]).t())
        .and(vx.slice(s![1.., ..-1]).t())
        .and(vy.slice(s![..-1, ..-1]).t())
        .and(vy.slice(s![..-1, 1..]).t())
        .apply(|div, &vx1, &vx2, &vy1, &vy2| {
            *div = -(vx1 - vx2 + vy1 - vy2);
        })

is equivalent to:

div -= vx.slice(s![..-1, ..-1].t();
div += vx.slice(s![1.., ..-1]).t();
div -= vy.slice(s![..-1, ..-1]).t();
div += vy.slice(s![..-1, 1..]).t();

But I don't have a parallel version of the latter!

@bluss
Copy link
Member

bluss commented Apr 6, 2017

Zip::indexed exists and is parallelized in master, just not in a proper (non-alpha) release yet. It enters there, rather than among the iterators. We're expanding and extending the NdProducer concept now. Splitting is something that a producer or intoiterator can implement, not iterator, is our current idea.

@bluss
Copy link
Member

bluss commented Apr 6, 2017

@msiglreith
Copy link
Contributor Author

Amazing work! Thanks for the quick implementation, will play around with it in the next days :)

@Renmusxd
Copy link

Have there been any updates on the IndexedIter and IndexedIterMut having parallelization support?

@bluss
Copy link
Member

bluss commented May 25, 2022

rayon support is through Zip, see the previous comment #279 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants