Description
Hi, thank you for developing and maintaining this awesome library and ecosystem!
I'm not entirely sure but could it be that the documentation for the AdamW
optimizer is a bit misleading? If I understand correctly, then its definition of
AdamW(η = 0.001, β = (0.9, 0.999), decay = 0) = Optimiser(Adam(η, β), WeightDecay(decay))
means that it performs this update (where
However, the paper on AdamW (which is linked to by the docs) parametrizes this differently as:
I.e. Flux's eta
corresponds to the paper's decay
corresponds to the paper's
This is probably super unimportant (in that case, sorry for the noise) but since I just noticed this during bug hunting in an implementation of mine (which uses AdamW), I thought I'd report it.