Skip to content

m.t() * m #445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nilgoyette opened this issue May 1, 2018 · 12 comments
Closed

m.t() * m #445

nilgoyette opened this issue May 1, 2018 · 12 comments

Comments

@nilgoyette
Copy link
Collaborator

nilgoyette commented May 1, 2018

Is there an optimization available for a matrix multiplication with its transpose? I'm trying to optimize a program where the slowest part is m.t() * m (a somewhat big matrix in the most inner loop). I read more about this and it always give a symmetrical matrix, which would make the operation n(n+1)/2 instead of n^2. The lapack function dsyrk is supposed to handle that. I don't know if it would actually help but I'm curious to test.

Also, is there a way to know if I'm using lapack? I didn't give any special feature to ndarray in my cargo.toml file. A perf told me 29.26% _ZN14matrixmultiply4gemm13masked_kernel so I think I'm using it because gemm is a lapack name. But is there a simpler way?

EDIT: Oh, sorry, I meant BLAS everywhere in my text. I wasn't aware of the difference :)

@vbarrielle
Copy link
Contributor

This signature is probably the signature of the pure rust implementation of matrix multiplication. ndarray can be configured to use a custom blas through the blas feature, but I don't know if there's some lapack integration available...

@rcarson3
Copy link
Contributor

rcarson3 commented May 3, 2018

Going off @vbarrielle 's earlier comment if you want to use a BLAS underlying feature you'd want to follow what's listed in the README which I've pasted down below:

How to enable blas integration. Depend on blas-src directly to pick a blas
provider. Depend on the same blas-src version as ndarray does, for the
selection to work. A proposed configuration using system openblas is shown
below. Note that only end-user projects (not libraries) should select
provider::

[dependencies]
ndarray = { version = "0.11.0", features = ["blas"] }
blas-src = { version = "0.1.2", default-features = false, features = ["openblas"] }
openblas-src = { version = "0.5.6", default-features = false, features = ["cblas", "system"] }

So, if you'd want to try out the dsyrk method I'd imagine you could do something similar to what's done in ndarray for gemm where they link to the underlying dgemm and sgemm BLAS calls. I'm mentioning this because ndarray does not appear to offer the optimized routine for this specific case, since they only link to gemm functions.

@nilgoyette
Copy link
Collaborator Author

@vbarrielle and @rcarson3, thank you for your answers. I'll test what you suggest and contribute if the results are interesting. Maybe having a dot_transpose method would be a plus for ndarray?

@jturner314
Copy link
Member

I think it would be useful to incorporate the *syrk functions into ndarray for 2-D arrays. I'd probably call the method dot_with_t.

On second thought, maybe it would make more sense to create a separate crate for providing ndarray‑based wrappers for the less common BLAS functions (similar to how ndarray-linalg and linxal provide wrappers for LAPACK functions)?

@rcarson3
Copy link
Contributor

rcarson3 commented May 3, 2018

I think the separate crate idea makes a lot of sense. I'm sure there are a lot of BLAS calls that aren't in ndarray that people would like to use, so a nice wrappers around them would make a lot of sense. It would also allow for the possibility of having better/more descriptive names for a variety of BLAS calls.

Edit: I'd also like to mention that I'd be up for helping work on this new crate. However, my current work load will probably limit my ability to contribute to it as much as I'd like for the next few months.

@nilgoyette
Copy link
Collaborator Author

I tested and the BLAS results are interesting! I made 4 simple benchmarks with 100x50 identity matrices. Results are in ns. As you can see, the speedup is around 2.8x .

                f32   f64
standard dot    24k   41k
transpose_dot    9k   14k

I also tested a normal Rust implementation but it's a disaster! I wasn't aware that dot is calling the matrixmultiply crate. I learned that this crate is much more optimized than what I could write in a day :)

I'm not finished with this task because I don't know if you want a new crate and I'm unsure how to manage the A'A or AA' option (enum or different method?) I can probably push what I did even if we don't have a concensus yet, if there's any interest?

@termoshtt
Copy link
Member

It looks great :)
ndarray-linalg can include such BLAS based functions since it already depends on BLAS implicitly.

FYI, A'A can be called as a square of "absolute value of matrix" |A|^2
https://math.stackexchange.com/questions/2114687/absolute-value-of-a-matrix

@nilgoyette
Copy link
Collaborator Author

@termoshtt Yes, maybe this stuff should be in your crate. Or not. It's not for me to decide! The owner(s) of this crate should state their opinion.

Also, I know this is not the right place for a math course but can you please explain more? As a programmer, "absolute value of matrix" means calling abs() on all numbers. A toy test quickly shows that A'A != |A|^2, except maybe for zero or identity matrices.

@jturner314
Copy link
Member

I'd prefer to place less-commonly-used BLAS wrappers such as this function in a crate outside of ndarray. If enough BLAS wrappers get implemented in the future, it would make sense to split off a separate ndarray-blas crate, but until then, ndarray-linalg sounds like a good place to me.

@nilgoyette
Copy link
Collaborator Author

Closing this issue. As explained, such a feature would be in another crate of the ndarray ecosystem.

@benkay86
Copy link
Contributor

benkay86 commented Nov 9, 2023

@nilgoyette, I'm running into the same issue where I am doing a lot of m.t*() * m in a tight loop and would like to take advantage of DSYRK() to make it faster. Would you mind sharing the code you used in your benchmark? Have you thought about submitting a PR to ndarray-linalg?

@nilgoyette
Copy link
Collaborator Author

Sorry @benkay86, I changed computed 2 times at work in the last 6 years and this particular folder has been lost.

IIRC, it wasn't terribly complex to code. I think I simply copied the "dot" code that calls gemm, replacing it with a call to syrk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants