Skip to content
This repository was archived by the owner on May 21, 2022. It is now read-only.
This repository was archived by the owner on May 21, 2022. It is now read-only.

More ways to handle non-divisable batch sizes #14

@oxinabox

Description

@oxinabox

After #9,
I was thinking about the ways one can handle non-dividable batch-size.
I think MLDataPattern could do with 1 or 2 more.
I will enumerate them here for consideration.

Truncate (current default).

Cut batchs of given size from the full set.
Discard any remainder

Round size down (current upto/max size)

Decrease the size until it reaches a divisor.
worse-case terminates when size==1, i.e. online

Useful for ensuring maximum amount of memory used per batch.

This is equiv to round count up.
It sets a minimum number of batchs.

Round size up

Increase the size until it reaches a divisor.
Worst-case terminates when size==n_obs, i.e. full-batch

Useful for ensuring a minimum number of observations to be considered be batch.

This is equiv to round count down.
It sets a maximum number of batchs.

Round size nearest

Alternatively consider increasing and decreasing size,
until it reaches a divisor.

Worse-cast terminates when size reaches the nearer of 1 or n_obs.

It gives Batches of closest possible even size to that given by user.

Remainder Batch

Take batchs of full size,
then adds an extra batch that is undersized to the end,
containing the remainder.

Uneven Batch Sizes

Increase the size of some of the batches, to eat the remainder of the division.

Assuming one's algorithm can handle varying sized batches,
this is probably the ideal.

To sketch the calculations out:

function uneven_sizes(data, size)
    n_observations = n_obs(data) 
    n_batchs = n_observations  ÷ size  
    remainder = n_observations % size
    batch_sizes = fill(n_batchs, size)
    @assert(remainder < size)
    everywhere_extra = remainder ÷ n_batchs
    extra_extra = remainder % n_batchs
    @assert(extra_extra < n_batches)
    batch_sizes[:]+=everywhere_extra
    batch_sizes[1:extra_extra]+=everywhere_extra
    batch_sizes
end

It can also be done to directly calculate index positions,
and even lazily, using CatViews of UnitRanges.


So that is 6 options, of which two are already implemented.
I'm not sure that all are required, though.

There are another couple more.
Eg

  • Uneven Batches, down except at the start you increase the number of batchs by 1, then shrink the batch_size.
  • Extra Batched, down: except instead of having an undersized batch at the end, you make the final batch oversized by appending the remainder.

I'm pretty sure there are all sanely implementable inside the datasubset paradigm, which is a positive sign for the overall architecture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions