Loss is coming nan #52

ridhajuneja123 · 2019-01-30T10:58:11Z

No description provided.

ridhajuneja123 · 2019-01-30T10:59:58Z

both loc_loss and cls_loss are coming nan can u suggest the solution

ResearchingDexter · 2019-03-18T01:57:00Z

the condition may be caused by the size of anchors that anchors'size can't match your detected objects

xhtian95 · 2019-03-21T08:37:01Z

I also meet this problem......

ResearchingDexter · 2019-03-21T13:20:57Z

both loc_loss and cls_loss are coming nan can u suggest the solution

ResearchingDexter · 2019-03-21T13:22:57Z

I suppose u can print the number of positive example and u can adjust the ratios of the anchor according the number

liqima · 2019-04-29T08:04:13Z

No description provided.

did you solved this problem?

anotherother · 2019-05-04T14:55:26Z

Who have such problem? May someone recommend the solution?
In my case it seems like this:
`loc_loss: 0.086 | cls_loss: 673.821 | Train_loss: 673.90753 | avg_loss: 673.90753

loc_loss: 0.088 | cls_loss: 540.022 | Train_loss: 540.11029 | avg_loss: 607.00891

loc_loss: 0.081 | cls_loss: 589.325 | Train_loss: 589.40613 | avg_loss: 601.14132

loc_loss: 0.081 | cls_loss: 418.840 | Train_loss: 418.92139 | avg_loss: 555.58633

loc_loss: 0.083 | cls_loss: 268.827 | Train_loss: 268.90982 | avg_loss: 498.25103

loc_loss: 0.086 | cls_loss: 211.607 | Train_loss: 211.69376 | avg_loss: 450.49149

loc_loss: 0.106 | cls_loss: 71.394 | Train_loss: 71.49988 | avg_loss: 396.34983

loc_loss: 0.075 | cls_loss: 28.076 | Train_loss: 28.15103 | avg_loss: 350.32498

loc_loss: 0.088 | cls_loss: 19.801 | Train_loss: 19.88938 | avg_loss: 313.60991

loc_loss: 0.086 | cls_loss: 12.623 | Train_loss: 12.70911 | avg_loss: 283.51983

loc_loss: 0.092 | cls_loss: inf | Train_loss: inf | avg_loss: inf

loc_loss: nan | cls_loss: nan | Train_loss: nan | avg_loss: nan

loc_loss: nan | cls_loss: nan | Train_loss: nan | avg_loss: nan`

anotherother · 2019-05-04T17:42:25Z

Problem was solved.
This is rewritten code


from __future__ import print_function

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

class FocalLoss(nn.Module):
    def __init__(self, num_classes):
        super(FocalLoss, self).__init__()
        self.num_classes = num_classes

    def _one_hot_embeding(self, labels):
        """Embeding labels to one-hot form.
        Args:
            labels(LongTensor): class labels
            num_classes(int): number of classes
        Returns:
            encoded labels, sized[N, #classes]
        """

        y = torch.eye(self.num_classes+1)  # [D, D]
        return y[labels]  # [N, D]

    def focal_loss(self, x, y):
        """Focal loss
        Args:
            x(tensor): size [N, D]
            y(tensor): size [N, ]
        Returns:
            (tensor): focal loss
        """

        alpha = 0.25
        gamma = 2

        t = self._one_hot_embeding(y.data.cpu())  # [N,21]
        t = t[:, 1:]  # exclude background
        t = Variable(t).cuda()  # [N,20]

        logit = F.softmax(x)
        logit = logit.clamp(1e-7, 1.-1e-7)
        conf_loss_tmp = -1 * t.float() * torch.log(logit)
        conf_loss_tmp = alpha * conf_loss_tmp * (1-logit)**gamma
        conf_loss = conf_loss_tmp.sum()

        return conf_loss

    def forward(self, loc_preds, loc_targets, cls_preds, cls_targets):
        """Compute loss between (loc_preds, loc_targets) and (cls_preds, cls_targets).
        Args:
          loc_preds(tensor): predicted locations, sized [batch_size, #anchors, 4].
          loc_targets(tensor): encoded target locations, sized [batch_size, #anchors, 4].
          cls_preds(tensor): predicted class confidences, sized [batch_size, #anchors, #classes].
          cls_targets(tensor): encoded target labels, sized [batch_size, #anchors].
        Returns:
          (tensor) loss = SmoothL1Loss(loc_preds, loc_targets) + FocalLoss(cls_preds, cls_targets).
        """

        pos = cls_targets > 0  # [N,#anchors]
        num_pos = pos.data.long().sum()

        # loc_loss = SmoothL1Loss(pos_loc_preds, pos_loc_targets)
        mask = pos.unsqueeze(2).expand_as(loc_preds)  # [N,#anchors,4]
        masked_loc_preds = loc_preds[mask].view(-1, 4)  # [#pos,4]
        masked_loc_targets = loc_targets[mask].view(-1, 4)  # [#pos,4]
        loc_loss = F.smooth_l1_loss(masked_loc_preds, masked_loc_targets, size_average=False)

        # cls_loss = FocalLoss(loc_preds, loc_targets)
        pos_neg = cls_targets > -1  # exclude ignored anchors
        # num_pos_neg = pos_neg.data.long().sum()
        mask = pos_neg.unsqueeze(2).expand_as(cls_preds)
        masked_cls_preds = cls_preds[mask].view(-1, self.num_classes)

        cls_loss = self.focal_loss(masked_cls_preds, cls_targets[pos_neg])
        num_pos = max(1.0, num_pos.item())

        print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.item() / num_pos, cls_loss.item() / num_pos), end=' | ')

        loss = loc_loss / num_pos + cls_loss / num_pos

        return loss
`

heartInsert · 2019-05-24T10:55:46Z

Thanks, It is worked. But can you tell me which statement did you change?
I try to but can't find it.
Thank you @miramind

Imagery007 · 2019-05-25T14:50:48Z

Where did you get ckpt.pth and params.pth? please help me @heartInsert ,thank you

heartInsert · 2019-05-25T15:01:27Z

I don't have pretrained model, I trained the code myself in voc dataset@Imagery007

Imagery007 · 2019-05-26T03:28:36Z

Thanks.I didn't understand before,Actually, net.pth can be trained without ckpt.pth and params.pth. @heartInsert Thanks again.

heartInsert · 2019-05-26T11:10:25Z

@Imagery007 Do you predict a real picture and draw bboxes in it ?
I think there is bug in train that I can't get right bboxes.

Imagery007 · 2019-05-26T13:55:31Z

@heartInsert Yes, I can't run test.py.I still don't know how to solve it.

RuntimeError: Error(s) in loading state_dict for RetinaNet:
Missing key(s) in state_dict: "fpn.conv1.weight", "fpn.bn1.weight", "fpn.bn1.bias", "fpn.bn1.running_mean", "fpn.bn1.running_var", "fpn.layer1.0.conv1.weight", "fpn.layer1.0.bn1.weight", "fpn.layer1.0.bn1.bias", "fpn.layer1.0.bn1.running_mean", "fpn.layer1.0.bn1.running_var", "fpn.layer1.0.conv2.weight", "fpn.layer1.0.bn2.weight", "fpn.layer1.0.bn2.bias", "fpn.layer1.0.bn2.running_mean", "fpn.layer1.0.bn2.running_var", "fpn.layer1.0.conv3.weight", "fpn.layer1.0.bn3.weight", "fpn.layer1.0.bn3.bias", "fpn.layer1.0.bn3.running_mean", "fpn.layer1.0.bn3.running_var", "fpn.layer1.0.downsample.0.weight",............

Imagery007 · 2019-05-27T07:07:03Z

@heartInsert I read the code carefully and successfully ran test.py. I found that test.py was only used to draw the filtered anchor and could not make predictions well.

Imagery007 · 2019-05-28T07:47:13Z

@miramind Hello, I want to know the meaning of t in the 43rd line of loss.py. Hope to get your reply.Thanks.

wvalcke · 2019-06-16T14:05:34Z

The effective reason the this line just before the print statement

num_pos = max(1.0, num_pos.item())

It makes num_pos a floating point number, this makes sure the loc_loss.item()/num_pos is a floating point result as well.

xiaoxifei1223 · 2019-07-10T01:32:04Z

In my experiment （custom data and VOC），I found the classification loss may be Nan and the reason is num_pos may be 0.

Wskisno1 · 2019-11-11T02:48:15Z

@heartInsert Yes, I can't run test.py.I still don't know how to solve it.

RuntimeError: Error(s) in loading state_dict for RetinaNet:
Missing key(s) in state_dict: "fpn.conv1.weight", "fpn.bn1.weight", "fpn.bn1.bias", "fpn.bn1.running_mean", "fpn.bn1.running_var", "fpn.layer1.0.conv1.weight", "fpn.layer1.0.bn1.weight", "fpn.layer1.0.bn1.bias", "fpn.layer1.0.bn1.running_mean", "fpn.layer1.0.bn1.running_var", "fpn.layer1.0.conv2.weight", "fpn.layer1.0.bn2.weight", "fpn.layer1.0.bn2.bias", "fpn.layer1.0.bn2.running_mean", "fpn.layer1.0.bn2.running_var", "fpn.layer1.0.conv3.weight", "fpn.layer1.0.bn3.weight", "fpn.layer1.0.bn3.bias", "fpn.layer1.0.bn3.running_mean", "fpn.layer1.0.bn3.running_var", "fpn.layer1.0.downsample.0.weight",............

I have the same problem as you，can you tell me how to slove it

Wskisno1 · 2019-11-11T02:48:45Z

@Imagery007

Dickoabc123 · 2023-06-28T18:47:01Z

@heartInsert Yes, I can't run test.py.I still don't know how to solve it.
RuntimeError: Error(s) in loading state_dict for RetinaNet:
Missing key(s) in state_dict: "fpn.conv1.weight", "fpn.bn1.weight", "fpn.bn1.bias", "fpn.bn1.running_mean", "fpn.bn1.running_var", "fpn.layer1.0.conv1.weight", "fpn.layer1.0.bn1.weight", "fpn.layer1.0.bn1.bias", "fpn.layer1.0.bn1.running_mean", "fpn.layer1.0.bn1.running_var", "fpn.layer1.0.conv2.weight", "fpn.layer1.0.bn2.weight", "fpn.layer1.0.bn2.bias", "fpn.layer1.0.bn2.running_mean", "fpn.layer1.0.bn2.running_var", "fpn.layer1.0.conv3.weight", "fpn.layer1.0.bn3.weight", "fpn.layer1.0.bn3.bias", "fpn.layer1.0.bn3.running_mean", "fpn.layer1.0.bn3.running_var", "fpn.layer1.0.downsample.0.weight",............

I have the same problem as you，can you tell me how to slove it

Hi there,,any luck with this? I'm having the same trouble and would love to know how to solve it.

Imagery007 mentioned this issue May 26, 2019

loc_loss: 0.000 | cls_loss: 1.000 | train_loss: 1.697 | avg_loss: 1.697 #59

Open

wvalcke mentioned this issue Jun 19, 2019

RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other' #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss is coming nan #52

Loss is coming nan #52

ridhajuneja123 commented Jan 30, 2019

ridhajuneja123 commented Jan 30, 2019

ResearchingDexter commented Mar 18, 2019

xhtian95 commented Mar 21, 2019

ResearchingDexter commented Mar 21, 2019

ResearchingDexter commented Mar 21, 2019

liqima commented Apr 29, 2019

anotherother commented May 4, 2019 •

edited

Loading

anotherother commented May 4, 2019 •

edited

Loading

heartInsert commented May 24, 2019

Imagery007 commented May 25, 2019

heartInsert commented May 25, 2019

Imagery007 commented May 26, 2019

heartInsert commented May 26, 2019

Imagery007 commented May 26, 2019

Imagery007 commented May 27, 2019

Imagery007 commented May 28, 2019

wvalcke commented Jun 16, 2019

xiaoxifei1223 commented Jul 10, 2019

Wskisno1 commented Nov 11, 2019

Wskisno1 commented Nov 11, 2019

Dickoabc123 commented Jun 28, 2023

Loss is coming nan #52

Loss is coming nan #52

Comments

ridhajuneja123 commented Jan 30, 2019

ridhajuneja123 commented Jan 30, 2019

ResearchingDexter commented Mar 18, 2019

xhtian95 commented Mar 21, 2019

ResearchingDexter commented Mar 21, 2019

ResearchingDexter commented Mar 21, 2019

liqima commented Apr 29, 2019

anotherother commented May 4, 2019 • edited Loading

anotherother commented May 4, 2019 • edited Loading

heartInsert commented May 24, 2019

Imagery007 commented May 25, 2019

heartInsert commented May 25, 2019

Imagery007 commented May 26, 2019

heartInsert commented May 26, 2019

Imagery007 commented May 26, 2019

Imagery007 commented May 27, 2019

Imagery007 commented May 28, 2019

wvalcke commented Jun 16, 2019

xiaoxifei1223 commented Jul 10, 2019

Wskisno1 commented Nov 11, 2019

Wskisno1 commented Nov 11, 2019

Dickoabc123 commented Jun 28, 2023

anotherother commented May 4, 2019 •

edited

Loading

anotherother commented May 4, 2019 •

edited

Loading