Skip to content

After using Focal Loss, the network does not converge #811

@pprp

Description

@pprp
f1_gamma=0.5
alpha=0.5/0.25

we get the error below:

WARNING: non-finite loss, ending training  tensor([9.14797,     nan, 0.00000,     nan], device='cuda:0')

After I set the parameters as:

f1_gamma=2
alpha=0.25

The network works but fails to converge.

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    76/272     4.97G      2.39  2.73e-06         0      2.39        82       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.77it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.09s/it]
                 all       391       409    0.0278     0.139   0.00872    0.0464

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    77/272     4.97G      2.35  2.73e-06         0      2.35        83       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.59it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.02s/it]
                 all       391       409    0.0463     0.169    0.0249    0.0727

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    78/272     4.97G      2.36  2.71e-06         0      2.36        83       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.58it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.40s/it]
                 all       391       409    0.0199     0.147   0.00453    0.0351

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    79/272     4.97G      2.35  2.72e-06         0      2.35        84       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.74it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.29s/it]
                 all       391       409    0.0146     0.132   0.00409    0.0262

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    80/272     4.97G      2.33  2.71e-06         0      2.33        85       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.66it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.03s/it]
                 all       391       409    0.0613     0.152    0.0397    0.0873

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    81/272     4.97G      2.35  2.74e-06         0      2.35        83       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.68it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.46s/it]
                 all       391       409    0.0137     0.112   0.00248    0.0244

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    82/272     4.97G      2.36  2.72e-06         0      2.36        80       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.65it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.40s/it]
                 all       391       409    0.0159     0.115   0.00383    0.0279

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    83/272     4.97G      2.33  2.78e-06         0      2.33        77       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.59it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.31s/it]
                 all       391       409    0.0288     0.174    0.0126    0.0495

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    84/272     4.97G      2.34  2.74e-06         0      2.34        99       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.59it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:05<00:00,  1.40s/it]
                 all       391       409    0.0225     0.147   0.00658     0.039

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    85/272     4.97G      2.34  2.73e-06         0      2.34        86       416: 100%|██████████████████████████████████████| 55/55 [00:15<00:00,  3.65it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:03<00:00,  1.03it/s]
                 all       391       409    0.0492     0.149    0.0127     0.074

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    86/272     4.97G      2.32  2.78e-06         0      2.32        79       416: 100%|██████████████████████████████████████| 55/55 [00:14<00:00,  3.78it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|████████████████████████████████████████| 4/4 [00:04<00:00,  1.04s/it]
                 all       391       409    0.0303     0.139   0.00757    0.0498

what's more, I only have one class and I use the command below:

python train.py --cfg cfg/yolov3-tiny.cfg --arc Fdefault 

Activity

glenn-jocher

glenn-jocher commented on Jan 29, 2020

@glenn-jocher
Member

@pprp there is about zero obj loss in your second example, so obviously the network will never learn obj this way.

glenn-jocher

glenn-jocher commented on Jan 29, 2020

@glenn-jocher
Member

@pprp also, if focal loss produces worse results, then clearly don't use it.

pprp

pprp commented on Jan 30, 2020

@pprp
Author

What should I do if i want use focal loss?

glenn-jocher

glenn-jocher commented on Jan 30, 2020

@glenn-jocher
Member

@pprp try different settings.

pprp

pprp commented on Jan 31, 2020

@pprp
Author

Thank you very much. I will try to fix this problem..

glenn-jocher

glenn-jocher commented on Jan 31, 2020

@glenn-jocher
Member

@pprp by the way, I was looking at the focal loss function. I think the reduction setting may need an update now that the loss reduction functions are set to sum rather than mean, so there may be a bug here that is our fault. I'll try to push an update today.

glenn-jocher

glenn-jocher commented on Jan 31, 2020

@glenn-jocher
Member

@pprp ok, the fix is done in 189c704

Can you git pull and try training again, starting from the default focal loss parameters?

self-assigned this
on Jan 31, 2020
pprp

pprp commented on Feb 1, 2020

@pprp
Author

Thanks for your reply, I will retrain tomorrow and inform you of the final result.

pprp

pprp commented on Feb 2, 2020

@pprp
Author

@glenn-jocher I try the fixed version but get the same problem.
I use your default focal loss parameters:

f1_gamma=0.5
alpha=1

if I use Fdefault, the network will get non-finite loss error.

if I use uFBCE, the network does not converge.


     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     9/272     4.98G      3.01     0.228         0      3.24        76       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:20<00:00,  3.78it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:36<00:00, 24.12s/it]
                 all       391       409   0.00201     0.902   0.00185   0.00401

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    10/272     4.98G      2.99     0.219         0      3.21        71       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:21<00:00,  3.54it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:35<00:00, 23.93s/it]
                 all       391       409   0.00203     0.914    0.0019   0.00406

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    11/272     4.98G       2.9     0.213         0      3.11        80       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:20<00:00,  3.70it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:37<00:00, 24.30s/it]
                 all       391       409   0.00205     0.922   0.00195   0.00409

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
    12/272     4.98G      2.83     0.193         0      3.03        86       416: 100%|███████████████████████████████████████████████████████████████████████████| 76/76 [00:20<00:00,  3.65it/s]
               Class    Images   Targets         P         R   mAP@0.5        F1: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [01:36<00:00, 24.16s/it]
                 all       391       409   0.00203     0.912   0.00191   0.00405

glenn-jocher

glenn-jocher commented on Feb 3, 2020

@glenn-jocher
Member

@pprp ah ok. Well, it seems focal loss is not the best choice for your problem. I recommend you stick to the repo defaults (i.e. --arc default). They are the defaults for a reason.

FranciscoReveriano

FranciscoReveriano commented on Feb 4, 2020

@FranciscoReveriano
Contributor

From experience @pprp Focal Loss is usually not the best way to go. I don't know what you are training on. But I would recommend either increasing the img-size, lowering the initial learning rate by a magnitude of 10, or lowering the training IoU.

pprp

pprp commented on Feb 14, 2020

@pprp
Author

In my problem, I want to use focal loss to balance the positive samples and negative samples.

I have a question about lobj.

In compute_loss function:

BCEobj = nn.BCEWithLogitsLoss(pos_weight=ft([h['obj_pw']]), reduction=red)

giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False,
                            GIoU=True)  # giou computation

tobj[b, a, gj, gi] = giou.detach().type(tobj.dtype)

 lobj += BCEobj(pi[..., 4], tobj)

Can you tell me why to calculate the loss between the output and the giou? Does this have an effect on the focal loss?

glenn-jocher

glenn-jocher commented on Feb 16, 2020

@glenn-jocher
Member

@pprp this is experimental. I think we will revert back to the original formulation below, we are currently testing the effect of the change. Focal loss is independent of this though.

tobj[b, a, gj, gi] = 1.0
github-actions

github-actions commented on Mar 18, 2020

@github-actions

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

StaleStale and schedule for closing soon

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @glenn-jocher@pprp@FranciscoReveriano

      Issue actions

        After using Focal Loss, the network does not converge · Issue #811 · ultralytics/yolov3