Using non-default streams in CUDA

#### Description of the problem

Currently, in #6 all the operations are issued to default stream. However, I was thinking that we can use non-default streams for issuing various kernels to different operations for their parallel execution.
An example of such a situation is filling `n` vectors parallelly with `fill_vector_kernel` launched in `n` separate streams. In fact, one more example can be to fill n*m matrix with `n` or `m` kernels launched in separate streams. 
Before moving on to the implementation we can discuss the API for the above use case.
Please comment below if you have thought of something. I will come up with the design soon.
One more advantage of using non-default streams as claimed by https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ is the overlap of data transfers and kernel execution. However IMO, this isn't really useful for this library because it may be the case that user wants to copy back only small sized `Vector` to host and for that wasting time in creating streams isn't a good idea.
#### Example of the problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using non-default streams in CUDA #2

Description of the problem

Example of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using non-default streams in CUDA #2

Description

Description of the problem

Example of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions