This document is written at v0.3.0 version.
Though I believe this crate provides many functionalities that interests audiences in scientific computation, there are also some points that this crate is lack of, or is not going to support with by design.
For the features that will not support with,
- Supports to other types of matrices. Currently crates such as
ndarray
,nalgebra
,matrix
,faer-rs
,rulinalg
,rest_tensors
represent typical matrix implementations in rust. Though some crates are extremely fast (comparable to MKL or OpenBLAS, especiallyfaer-rs
) in linalgs, it seems thatndarray
could better support features in high-dimension tensors and sub-matrices, as well as advanced slicing/view. This is also related to concept of "leading dimension" in BLAS. Sondarray
is choosen to represent matrix in this crate. Matrices defined in other crates should be able to easily casted tondarray
, without explicit memory copy (by rust's moving or by tensor view of slice from raw-data), and thus not providing BLAS to other types of matrices. - Other targets (GPU). This may well require a new data structure (such as numpy v.s. torch/jax). Though machine learning frameworks (such as
candle
,burn
, etc.) seem to be promising, an API-stable tensor structure that accepts various targets, with advanced slicing/stride support has probably yet existed in rust. - Arbitary data types. Currently, this crate supports f32/f64/c32/c64, which should be supported by legacy BLAS standard. However, this crate will not implement something like int8/rug. To address this issue, a BLAS reference implementation to any types is required, and is out of scope for this crate, which is only a high-level wrapper to BLAS (or BLAS-like) functions.
- Fn instead of struct. A common sense for using BLAS functions with matrices, is function with optional parameters. However, this is not possible in rust, syntactically. So we choose to use struct (with
derive_build
) to pass optional parameters. In this way, there is at least one additional drawback: no IDE-time/compile-time check to non-optional parameters, so errors may occur in runtime; this require programmers to use this crate with caution. - Lapack wrapper. This is surely important, but will probably be implemented in a new crate.
For the features that can be added, but currently haven't been added (probably I'm not interested in for the moment I'm writing this crate) and may be implemented in a later time, or may be implemented after someone giving feature requests in issues/PRs.
- BLAS1 functions. I think that in many cases, functionality of BLAS1 can be simply covered by iterators, or features in matrix libraries such as
ndarray
. Efficiency of BLAS1 can be achieved by compiler optimization (especially for serial/single-thread usage), using BLAS may not significantly improve efficiency if your code is mostly computation-bounded instead of memory-bounded. - Other kinds of floats. With development of machine learning nowadays, demands of low-precision BLAS is increasing; MKL and OpenBLAS has already implemented some
BF16
features. - BLAS extensions. There are some important BLAS extensions, such as
omatcopy
,imatcopy
,gemmt
,gemm3m
, that has already been implemented in both OpenBLAS, MKL and BLIS. Currently,gemmt
has been implemented. Others are on-going work. - Documents. I hope that this crate is clear and simple to use by itself, especially to programmers from C/C++/F77. This crate want to be
scipy.linalg.blas
replacement in rust. So documentation may not be of that important (https://netlib.org/lapack/explore-html/ is farely well documented), but probably it's still required to newcommers. - Tests. It works, but it is messy. Boundary conditions have not been fully checked.
For example of GEMM:
-
trait GEMMNum: BLASFloat
This trait directly wraps FFI function
?gemm_
, and implemented directly to float types (f32
,f64
,Complex<f32>
,Complex<f64>
if appropriate). With this trait, one can call FFI functiondgemm_
byf64::gemm
. -
struct GEMM_Driver
(with trait boundBLASDriver
)This struct is a (not so safe) wrapper to FFI wrapper
GEMMFunc
. This struct takes rust types, and return result by calling functionrun_blas
function, which calls FFI function.On calling function
run_blas
, all fields inGEMM_Driver
will be consumed (dropped), and only result matrixwill be returned. -
struct GEMM_
(with trait boundBLASBuilder_
)This struct acts as inner driver, and will be wrapped by
GEMM_Builder
by cratederive_builder
.Trait function
driver
(inner driver, trait function) ofBLASBuilder_
is not the same to struct functiondriver
(outer driver, automatically derived) ofGEMM_Builder
.This struct is not designed to be directly called. When calling
driver
function inGEMM_
,- All input (except output) should be column-major;
- All fields should be filled;
-
layout
must beSome(BLASColMajor)
; - Quick return if zero dimension input;
- Output
c
can be arbitary major, or simplyNone
.
Inner driver will perform
- Obtaining dimension and leading dimension that will be needed in FFI functions,
- Checking validability of parameters (all in column-major),
- Allocating output buffer if output
c
is not column-major (to beArrayOut::ToBeCloned
), or not defined (to beArrayOut::Owned
). - Returning FFI
GEMM_Driver
for future computation.
-
struct GEMM_Builder
(with trait boundBLASBuilder
)This struct is builder, automatically derived by crate
derive_builder
, which is designed to be accessible by API user.This struct accepts input with arbitary stride, though this is not recommanded.
Trait function
run
will perform the following works:- Check layout (will be further explained in Layout Convention),
- Perform trans / side / uplo flag flip, and matrix transposition to proper layout;
- Generate
GEMM_
struct, and perform FFI computation; - If layout is row-major, reverse axis of output matrix from FFI (FFI only accepts column-major matrix).
For this crate, row-major and col-major is not the same meaning to C-contiguous and F-contiguous (flags of numpy).
Contiguous means that all data in memory is contiguous. Difference of C-contiguous and F-contiguous in that how the matrix is presented in memory.
C-contiguous F-contiguous Non-contiguous Sequential
(by row) (by col) (by row) (onedim)
+++++ +++-- ----- +++++
+++++ +++-- -+++-
+++++ +++-- -+++-
----- +++-- -+++-
----- +++-- -----
shape [3, 5] [5, 3] [3, 3]
strides [5, 1] [1, 5] [5, 1]
Major does not require all data in memory is contiguous. However, it require data in one dimension to be contiguous.
row-major col-major row/col-major row/col-major custom-layout
(by row) (by col) (by row) (by col) (by row)
++++- ----- ----- ----- +-+-+
----- +-+-+ -+++- -+++- -----
++++- +-+-+ -+++- -+++- +-+-+
----- +-+-+ -+++- -+++- -----
++++- +-+-+ ----- ----- +-+-+
shape [ 3, 4] [4, 3] [3, 3] [3, 3] [ 3, 3]
strides [10, 1] [1, 10] [5, 1] [1, 5] [10, 2]
In this crate, we check row/col-major if stride of second/first dimension is one.
For all BLAS3 functions, specifying row-major or col-major to keyword layout
will not change result; however, efficiency may vary.
Layout is defined by this way:
- Use layout provided by user; otherwise (user does not explicitly provide layout)
- If output matrix provided by user is either row-major or col-major, then specify layout by output matrix layout; otherwise (output matrix is custom-layout, or no output matrix provided)
- If all input is col-major, then specify layout as col-major; otherwise, specify layout as row-major.
For packed and banded operations, row-major and col-major could incur different results. For more information, we refer to BLAST document of C Interface to the Legacy BLAS.
In other cases, result is unrelated to layout, as the same case in BLAS3.