This repo contains a PyTorch implementation of learning rate dropout from the paper "Learning Rate Dropout" by Lin et al.
To train a ResNet34 model on CIFAR-10 with the paper's hyperparameters, do
python main.py --lr=.1 --lr_dropout_rate=0.5
The original code is from the pytorch-cifar repo. It uses track-ml for logging metrics. This implementation doesn't add standard dropout.
The vanilla method is frompytorch-cifar
: SGD with lr=.1, momentum=.9, weight_decay=5e-4, batch_size=128
. The SGD-LRD method uses lr_dropout_rate=0.5
. I ran four trials for each method.
It looks like LRD helps in the beginning of training, but does not provide major boosts after applying the LR schedule. Here are the final test accuracies:
Method | This repo | Paper |
---|---|---|
Vanilla | 95.45% | 95.30% |
SGD-LRD | 94.43% | 95.54% |
Shorty after this repo was published, the authors created an official repo for their paper here. The only differences I could find between the implementations are:
- The official code uses
torch.bernoulli
for the mask while I use(torch.rand_like(...) < lr_dropout_rate).type(d_p.dtype)
. - I use in-place elementwise-multiplication (
.mul_
) while they use*
. - They clone
buf
before adding it to the parameters. - They multiply the LR and mask before adding it to the parameters, while I wait until the end and do
p.data.add_(-group["lr"], d_p)
.
It's unclear why these small differences would lead to such a large gap in performance between the implementations.