--resume issues #292

Laughing-q · 2020-07-04T04:54:21Z

when i resume my training, learning rate will be reset(0.01) at the first epoch.
I find it seems that initializing the scheduler will reset the learning rate.
so i print the optimizer when i load model from my last.pt and after initialize scheduler.
code:
optimizer = optim.Adam(pg0, lr=hyp['lr0']) if opt.adam else
optim.SGD(pg0, lr=hyp['lr0'], momentum=hyp['momentum'], nesterov=True)
optimizer.add_param_group({'params': pg1, 'weight_decay': hyp['weight_decay']}) # add pg1 with weight_decay
optimizer.add_param_group({'params': pg2}) # add pg2 (biases)
.
.
.
# load optimizer
if ckpt['optimizer'] is not None:
optimizer.load_state_dict(ckpt['optimizer'])
print('before:', optimizer)
best_fitness = ckpt['best_fitness']
.
.
.
lf = lambda x: (((1 + math.cos(x * math.pi / epochs)) / 2) ** 1.0) * 0.9 + 0.1 # cosine
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)
scheduler.last_epoch = start_epoch - 1 # do not move
print('after scheduler :', optimizer)

result:
before:
SGD (
Parameter Group 0
dampening: 0
initial_lr: 0.01
lr: 0.009996052735444863
momentum: 0.937
nesterov: True
weight_decay: 0

Parameter Group 1
dampening: 0
initial_lr: 0.01
lr: 0.009996052735444863
momentum: 0.937
nesterov: True
weight_decay: 0.0005

Parameter Group 2
dampening: 0
initial_lr: 0.01
lr: 0.009996052735444863
momentum: 0.937
nesterov: True
weight_decay: 0
)
after scheduler :
SDG(
Parameter Group 0
dampening: 0
initial_lr: 0.01
lr: 0.01
momentum: 0.937
nesterov: True
weight_decay: 0

Parameter Group 1
dampening: 0
initial_lr: 0.01
lr: 0.01
momentum: 0.937
nesterov: True
weight_decay: 0.0005

Parameter Group 2
dampening: 0
initial_lr: 0.01
lr: 0.01
momentum: 0.937
nesterov: True
weight_decay: 0
)

Epoch gpu_mem GIoU obj cls total targets img_size
5/299 4.15G 0.03826 0.01146 0.02194 0.07166 15 640: 12%|█▏ | 31/267 [00:19<02:04, 1.90it/s]

then I put initializing scheduler code on top of loading optimizer，it seems to be right.
code:
optimizer = optim.Adam(pg0, lr=hyp['lr0']) if opt.adam else
optim.SGD(pg0, lr=hyp['lr0'], momentum=hyp['momentum'], nesterov=True)
optimizer.add_param_group({'params': pg1, 'weight_decay': hyp['weight_decay']}) # add pg1 with weight_decay
optimizer.add_param_group({'params': pg2}) # add pg2 (biases)
lf = lambda x: (((1 + math.cos(x * math.pi / epochs)) / 2) ** 1.0) * 0.9 + 0.1 # cosine
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)
.
.
.
# load optimizer
if ckpt['optimizer'] is not None:
optimizer.load_state_dict(ckpt['optimizer'])
print(optimizer)
best_fitness = ckpt['best_fitness']
.
.
.
scheduler.last_epoch = start_epoch - 1 # do not move
print(optimizer)

results:
before:
SGD (
Parameter Group 0
dampening: 0
initial_lr: 0.01
lr: 0.009996052735444863
momentum: 0.937
nesterov: True
weight_decay: 0

Parameter Group 1
dampening: 0
initial_lr: 0.01
lr: 0.009996052735444863
momentum: 0.937
nesterov: True
weight_decay: 0.0005

Parameter Group 2
dampening: 0
initial_lr: 0.01
lr: 0.009996052735444863
momentum: 0.937
nesterov: True
weight_decay: 0
)
after scheduler :
SGD (
Parameter Group 0
dampening: 0
initial_lr: 0.01
lr: 0.009996052735444863
momentum: 0.937
nesterov: True
weight_decay: 0

Parameter Group 1
dampening: 0
initial_lr: 0.01
lr: 0.009996052735444863
momentum: 0.937
nesterov: True
weight_decay: 0.0005

Parameter Group 2
dampening: 0
initial_lr: 0.01
lr: 0.009996052735444863
momentum: 0.937
nesterov: True
weight_decay: 0
)

is this a problem？the code is in train.py, please check it.

github-actions · 2020-07-04T04:55:02Z

Hello @Laughing-q, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

Cloud-based AI systems operating on hundreds of HD video streams in realtime.
Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

glenn-jocher · 2020-07-04T23:18:58Z

@Laughing-q thanks for looking into this! If you believe you have a solution which --resumes better, please submit a PR with your proposed changes.

--resume is not 100% mature yet, as the EMA is saved during checkpointing but not the normal model, so there would be additional steps to take to create a seamless resume, but fixing this LR bug you seem to have found wout be a huge step in the right direction!

Can you show before and after tensorboard plots for interupted coco128.yaml trainings for example using your fix?

Laughing-q · 2020-07-05T06:47:38Z

@Laughing-q thanks for looking into this! If you believe you have a solution which --resumes better, please submit a PR with your proposed changes.

--resume is not 100% mature yet, as the EMA is saved during checkpointing but not the normal model, so there would be additional steps to take to create a seamless resume, but fixing this LR bug you seem to have found wout be a huge step in the right direction!

Can you show before and after tensorboard plots for interupted coco128.yaml trainings for example using your fix?

I actually found this problem when I was training my dataset, I resume at about 200/299 epoch, then I found that map50 dropped from 80 to 67. but I delete the results.......
So I retrained my dataset for the experiment，resume at 15/19 epoch. In order to eliminate the influence of burn-in, i set n_burn=-1.
here is the results, blue line is the result with my fix. The precision decreased and losses increased significantly at epoch 15 without my fix. Although the final results of my dataset seem to be little different, I think the blue line is what I expect.

I also plot LR changes. Without my fix, LR is clearly reset at epoch15.

glenn-jocher · 2020-07-05T06:53:10Z

@Laughing-q hey good job, that's a very good experiment! It looks good to me. Can you submit a PR with the fix please? Thanks!!

Laughing-q · 2020-07-05T07:26:36Z

@Laughing-q hey good job, that's a very good experiment! It looks good to me. Can you submit a PR with the fix please? Thanks!!

of course

glenn-jocher · 2020-07-08T04:26:12Z

I started training a few yolov5s variants just before the PR was merged. One of them the VMs that was training terminated early (for google cloud maintenance I assume), so I used the --resume command to see what it would look like, and this is the result.

This huge drop must be from the LR=0.01 change in the first epoch after resume as you discovered. Hopefully your PR fixed this. From now on in the future new runs will have your fix included, so if this happens again in the next few weeks I'll repeat the experiment to see the new results.

glenn-jocher · 2020-07-09T02:29:01Z

@Laughing-q ok, I have some more information here. I --resumed 2 models that had quit unexpectedly. These two models should have the updated scheduler PR in place. Unfortunately I still see the same style of recovery.

One major aspect I noticed is that in both cases the train losses recover very quickly, within a few epochs, while the validation losses and performance metrics both take very long to recover. The one thing the val losses and performance metrics have in common is they use the EMA, whereas train losses are displayed for the base model.

This makes sense, as when a new run is started, or if a run is --resumed, a new EMA is created, and it uses a sliding decay constant that starts from zero and trends towards its final value 0.9999 after a few thousand iterations. This means the resumed EMA's are losing their smoothness, they are turning basically into base models, and slowly becoming a smoothed EMA over many epochs.

I think there's an easy way to fix this then without having to include the base model in the checkpoints: just add a flag to the EMA init() function that tells it how many iterations the baseline model has trained for, so it knows what to set the decay constant to. Then training should align much more closely with training before --resume. This decay constant equation is on L199 here:

yolov5/utils/torch_utils.py

Lines 194 to 199 in bf6f415

    
           def __init__(self, model, decay=0.9999, device=''): 
        
               # Create EMA 
        
               self.ema = deepcopy(model.module if is_parallel(model) else model)  # FP32 EMA 
        
               self.ema.eval() 
        
               self.updates = 0  # number of EMA updates 
        
               self.decay = lambda x: decay * (1 - math.exp(-x / 2000))  # decay exponential ramp (to help early epochs)

Laughing-q · 2020-07-09T04:32:46Z

@glenn-jocher Actually my friend had this problem few days ago, so I opened this issue #310 .Then I did some experiment about it , I saved both base model and EMA. when I resumed, I loaded both base model and EMA(also update the self.updates), but my results seemed to be little different from loading EMA only. From your results, I think that perhaps my epoch was too small(about 25~35) to see the huge drop. I'm glad you found the problem, so is this problem fixed now?

glenn-jocher · 2020-07-09T04:42:24Z

@Laughing-q no, the problem is not fixed, but I think smart init of the EMA as described above would help a lot.

And yes you are right, you will only see this later on in training, as the EMA decay function takes about 10-20k iterations to start reaching close to it's steady state 0.9999. On COCO this might be 5-10 epochs or more, so for smaller datasets (like perhaps your example) the problem would not show up. I'll look at #310.

Laughing-q · 2020-07-09T06:37:30Z

@glenn-jocher so what you mean is, add a flag to record the iterations and do not save base model. In this case, maybe the EMA's smoothness will be kept, but the previous EMA model will be used for --resume training. I still think that using base model for --resume training is right if save both of EMA and base model will not cost too much. My point is that base model is base model and EMA is EMA, it's not reasonable to use EMA for training when --resume, cause we use base model for training before --resume. perhaps I think too much, there still need some experiment to do.

glenn-jocher · 2020-07-09T17:14:08Z

@Laughing-q I don't really know. I'll try to update the code today with my idea, since its a very small change that may fix a lot. If it doesn't fix things though, then yes we'll have to update the code further to save everything.

glenn-jocher · 2020-07-09T17:49:50Z

Here are updated results from the above 2 models. It seems things do eventually return to normal after 20 or so epochs, which is the right amount of time for the EMA decay to reach steady state.

Probably the correct way to think about the EMA is that in the early stages of training it is trailing the model. It is smoother, less noisy, and delayed in time following the model changes. In the later stages of training, it is not trailing anymore, it is more a mean or centered version of the model. As the model oscillates or moves about a local minima the EMA will be at rest near the local minima more or less is how I imagine it. To further complicate it, the decay rise I've added means that the EMA and the model behave very similarly in the very early stages of training, but this is not important really in the final result.

Laughing-q · 2020-07-10T02:11:36Z

@glenn-jocher I see, thanks for your fix. I'll do some experiments with your fix, and I think it'll work. Thanks a lot!

glenn-jocher · 2020-07-10T05:15:21Z

@Laughing-q ok great! You probably won't notice any effect for early epochs or small datasets, but for example if you train yolov5s on coco for 10 or 20 epochs, this might hopefully result in better resuming if my idea is right.

glenn-jocher · 2020-07-12T22:16:09Z

@Laughing-q I can confirm that --resume is now apparently working near-seamlessly. I resumed a stopped run 266 here at COCO epoch 77, and the effects are hardly noticeable. It looks like our changes worked! I'm going to remove the TODO label from this issue, as I think it is now essentially completely resolved.

Laughing-q · 2020-07-13T00:45:40Z

@glenn-jocher Some results of my experiment also show that --resume is working near-seamlessly. thanks for your work！

zcode86 · 2020-07-14T03:52:51Z

@glenn-jocher Nice to see it, what affected it? EMA?

glenn-jocher · 2020-07-14T21:19:46Z

@Laughing-q @iamastar88 strangely enough I had to eliminate the update. I resumed a larger 5l model and observed strange behavior after the resume. The 'seam' was fine, but the model performed much more poorly than it's peers after resuming, and even 30 epochs afterwards was underperforming by several mAP points.

The prior --resume functionality shown in #292 (comment) has a very large discontinuity, but eventually recovers very well, so that after about 20-30 COCO epochs the --resumed model is training identically to it's peers. Not sure why.

Laughing-q · 2020-07-15T02:28:10Z

@glenn-jocher so is there still a huge drop on mAP? As far as I'm concerned, could you please save base model and EMA then --resume both of them for the same experiment？ or there are some other issues we didn't find yet.

glenn-jocher · 2020-07-15T04:24:57Z

@Laughing-q yes that's a third option. We'd have to test to see its effect of course, I just don't have time to do this myself.

Laughing-q · 2020-07-15T05:08:17Z

@glenn-jocher ok, I know you are so busy, maybe I will do some experiments on this, but I don't have such a device to run a huge dataset such as coco. What I can only do is to run my own dataset, but my concern is that my dataset is too small to see the effect and difference. If my result shows some effects and differences, I'll report it on this issue.

glenn-jocher · 2020-07-15T05:17:55Z

@Laughing-q actually some people see mAP jumps on --resume, so for some it's working better than expected #409. It seems to be very dataset specific.

Laughing-q · 2020-07-16T11:36:23Z

@Laughing-q @iamastar88 strangely enough I had to eliminate the update. I resumed a larger 5l model and observed strange behavior after the resume. The 'seam' was fine, but the model performed much more poorly than it's peers after resuming, and even 30 epochs afterwards was underperforming by several mAP points.

The prior --resume functionality shown in #292 (comment) has a very large discontinuity, but eventually recovers very well, so that after about 20-30 COCO epochs the --resumed model is training identically to it's peers. Not sure why.

@glenn-jocher I did my experiment, ran 250 epochs, then interrupted, and --resume run to 300 epochs. With your fix, Saving EMA only is almost the same as saving both EMA and base model. I think your fix is right and working! Have you ever encountered this issue without your fix? In the later stages of training, EMA should not be trailing anymore, so I think it should be caused by other reasons, not your fix. This issus is strange. sorry, I couldn't run the coco dataset on yolov5l with my device, so I couldn't do this experiment. But I think maybe there is some other issus that we haven't found out. If you find out, please tell me, thanks a lot.

glenn-jocher · 2020-07-16T16:44:29Z

Hmm ok! Yeah I had thought I had a good fix setting the Ema updates value, I don’t know why it didn’t work for me.

Laughing-q · 2020-07-29T08:30:41Z

@glenn-jocher I saw that #300 is missing in the newest code, is this fix covered by something? or the LR bug was solved in other ways？

Laughing-q · 2020-07-29T08:35:32Z

@glenn-jocher initializing scheduler code which will reset LR is under the loading optimizer code now.

glenn-jocher · 2020-07-29T20:36:01Z

@Laughing-q ah, thanks for catching this. I think I may be accepting too many PRs too quickly. Often times PRs are not rebased against the current master and the PR inadvertently reverts earlier changes. It's a huge problem with no easy solution, so I'm being much more careful with larger PRs now.

Ok, about this particular issue, can you submit a new PR please? Sorry for the double effort.

Laughing-q · 2020-07-30T11:02:37Z

@glenn-jocher I've submitted a new PR, please check it.

Laughing-q added the bug label Jul 4, 2020

glenn-jocher mentioned this issue Jul 5, 2020

fix LR bug #300

Merged

Laughing-q mentioned this issue Jul 7, 2020

Order of optimizer.step() and lr_scheduler.step() #313

Closed

NTU-P04922004 mentioned this issue Jul 7, 2020

Restart training from presaved checkpoint doesn't seem to work #316

Closed

glenn-jocher changed the title ~~resume problem~~ --resume issues Jul 9, 2020

glenn-jocher added the TODO label Jul 9, 2020

glenn-jocher self-assigned this Jul 9, 2020

glenn-jocher mentioned this issue Jul 9, 2020

EMA questions when resume #310

Closed

glenn-jocher closed this as completed in 24c5a94 Jul 9, 2020

glenn-jocher removed the TODO label Jul 12, 2020

glenn-jocher mentioned this issue Jul 15, 2020

mAP grows a lot when --resume #409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--resume issues #292

--resume issues #292

Laughing-q commented Jul 4, 2020

github-actions bot commented Jul 4, 2020 •

edited by glenn-jocher

Loading

glenn-jocher commented Jul 4, 2020

Laughing-q commented Jul 5, 2020

glenn-jocher commented Jul 5, 2020

Laughing-q commented Jul 5, 2020

glenn-jocher commented Jul 8, 2020

glenn-jocher commented Jul 9, 2020

Laughing-q commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

Laughing-q commented Jul 9, 2020 •

edited

Loading

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

Laughing-q commented Jul 10, 2020

glenn-jocher commented Jul 10, 2020

glenn-jocher commented Jul 12, 2020

Laughing-q commented Jul 13, 2020

zcode86 commented Jul 14, 2020

glenn-jocher commented Jul 14, 2020

Laughing-q commented Jul 15, 2020

glenn-jocher commented Jul 15, 2020

Laughing-q commented Jul 15, 2020

glenn-jocher commented Jul 15, 2020 •

edited

Loading

Laughing-q commented Jul 16, 2020

glenn-jocher commented Jul 16, 2020

Laughing-q commented Jul 29, 2020

Laughing-q commented Jul 29, 2020

glenn-jocher commented Jul 29, 2020

Laughing-q commented Jul 30, 2020

--resume issues #292

--resume issues #292

Comments

Laughing-q commented Jul 4, 2020

github-actions bot commented Jul 4, 2020 • edited by glenn-jocher Loading

glenn-jocher commented Jul 4, 2020

Laughing-q commented Jul 5, 2020

glenn-jocher commented Jul 5, 2020

Laughing-q commented Jul 5, 2020

glenn-jocher commented Jul 8, 2020

glenn-jocher commented Jul 9, 2020

Laughing-q commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

Laughing-q commented Jul 9, 2020 • edited Loading

glenn-jocher commented Jul 9, 2020

glenn-jocher commented Jul 9, 2020

Laughing-q commented Jul 10, 2020

glenn-jocher commented Jul 10, 2020

glenn-jocher commented Jul 12, 2020

Laughing-q commented Jul 13, 2020

zcode86 commented Jul 14, 2020

glenn-jocher commented Jul 14, 2020

Laughing-q commented Jul 15, 2020

glenn-jocher commented Jul 15, 2020

Laughing-q commented Jul 15, 2020

glenn-jocher commented Jul 15, 2020 • edited Loading

Laughing-q commented Jul 16, 2020

glenn-jocher commented Jul 16, 2020

Laughing-q commented Jul 29, 2020

Laughing-q commented Jul 29, 2020

glenn-jocher commented Jul 29, 2020

Laughing-q commented Jul 30, 2020

github-actions bot commented Jul 4, 2020 •

edited by glenn-jocher

Loading

Laughing-q commented Jul 9, 2020 •

edited

Loading

glenn-jocher commented Jul 15, 2020 •

edited

Loading