Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other' #62

Closed
sunshine-zkf opened this issue Jun 16, 2019 · 16 comments

Comments

@sunshine-zkf
Copy link

when i run the train.py, there is a problem as fellow:

/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/loss.py:95: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.data[0]/num_pos, cls_loss.data[0]/num_peg), end=' | ')
Traceback (most recent call last):
File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/train.py", line 116, in
train(epoch)
File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/train.py", line 77, in train
loss = criterion(loc_preds, loc_targets, cls_preds, cls_targets)
File "/home/sunshine_zkf/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/loss.py", line 95, in forward
print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.data[0]/num_pos, cls_loss.data[0]/num_peg), end=' | ')
RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other'

why? Can you help me? Thank you very much!

@sunshine-zkf
Copy link
Author

@kuangliu

@wvalcke
Copy link

wvalcke commented Jun 16, 2019

In utils.py you need to change the following
a = torch.arange(0,x)
b = torch.arange(0,y)

by

a = torch.arange(0,x,dtype=torch.float)
b = torch.arange(0,y,dtype=torch.float)

Also you probably need to change every call like .data[0] by .item()

@sunshine-zkf
Copy link
Author

In utils.py you need to change the following
a = torch.arange(0,x)
b = torch.arange(0,y)

by

a = torch.arange(0,x,dtype=torch.float)
b = torch.arange(0,y,dtype=torch.float)

Also you probably need to change every call like .data[0] by .item()

I modify it as you suggest, but the following errors have occurred:
I tried to modify the loc_loss.data[0].item() and the following , the same errors as following.
Traceback (most recent call last):
File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/train.py", line 116, in
train(epoch)
File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/train.py", line 77, in train
loss = criterion(loc_preds, loc_targets, cls_preds, cls_targets)
File "/home/sunshine_zkf/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/sunshine_zkf/RetinaNet/pytorch-retinanet-master/loss.py", line 95, in forward
print('loc_loss: %.3f | cls_loss: %.3f' % (loc_loss.item()/num_pos, cls_loss.item()/num_peg), end=' | ')
File "/home/sunshine_zkf/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 320, in rdiv
return self.reciprocal() * other
RuntimeError: reciprocal is not implemented for type torch.cuda.LongTensor

what's problem? Can you help me,Thank you very much!

@wvalcke
Copy link

wvalcke commented Jun 17, 2019

You did
loc_loss.data[0].item()

But it should be
loc_loss.item()

Check other references like this and change them all

@sunshine-zkf
Copy link
Author

You did
loc_loss.data[0].item()

But it should be
loc_loss.item()

Check other references like this and change them all

Thank you very much! I run the train.py successfully !
The main reason seems to be the problem of pytorch's version.
Except for modifying loc_loss.item(), it's necessory to modify the following:
num_pos = pos.data.long().sum().item()

@sunshine-zkf
Copy link
Author

You did
loc_loss.data[0].item()

But it should be
loc_loss.item()

Check other references like this and change them all

Sorry, disturb you. I was wondering if loss is this kind of situation is correct, when it is starting training!

loc_loss: 0.085 | cls_loss: 0.001 | train_loss: 0.087 | avg_loss: 0.088
loc_loss: 0.082 | cls_loss: 0.001 | train_loss: 0.083 | avg_loss: 0.088
loc_loss: 0.087 | cls_loss: 0.001 | train_loss: 0.088 | avg_loss: 0.088
loc_loss: 0.082 | cls_loss: 0.001 | train_loss: 0.083 | avg_loss: 0.088
loc_loss: 0.081 | cls_loss: 0.001 | train_loss: 0.083 | avg_loss: 0.088
loc_loss: 0.081 | cls_loss: 0.001 | train_loss: 0.082 | avg_loss: 0.088
loc_loss: 0.090 | cls_loss: 0.001 | train_loss: 0.091 | avg_loss: 0.088
loc_loss: 0.084 | cls_loss: 0.001 | train_loss: 0.085 | avg_loss: 0.088
loc_loss: 0.082 | cls_loss: 0.001 | train_loss: 0.083 | avg_loss: 0.088
loc_loss: 0.085 | cls_loss: 0.001 | train_loss: 0.087 | avg_loss: 0.088
loc_loss: 0.083 | cls_loss: 0.001 | train_loss: 0.085 | avg_loss: 0.088
loc_loss: 0.090 | cls_loss: 0.001 | train_loss: 0.091 | avg_loss: 0.088
loc_loss: 0.084 | cls_loss: 0.001 | train_loss: 0.085 | avg_loss: 0.088
loc_loss: 0.080 | cls_loss: 0.001 | train_loss: 0.081 | avg_loss: 0.088

@wvalcke
Copy link

wvalcke commented Jun 18, 2019

Difficult to say without knowing what you want to train.
If possible sent me your train/test index files.
What are your training images ?
Are you training on your own set, or an existing one ?
If you are starting from an already trained model, it can be normal that the loss is very low at the beginning.

@sunshine-zkf
Copy link
Author

Difficult to say without knowing what you want to train.
If possible sent me your train/test index files.
What are your training images ?
Are you training on your own set, or an existing one ?
If you are starting from an already trained model, it can be normal that the loss is very low at the beginning.

I am training on VOC2012 dataset that match the file ./data/voc12_train.txt and voc12_val.txt in this repo.
I use the net.pth downloaded the onlion. So, am i staring from an already trained model ?

Then i modify the loss.py follow you #56 , the problem got a little better, but it didn't make much difference

@wvalcke
Copy link

wvalcke commented Jun 18, 2019

Have you used the script get_state_dict.py ?
This initializes the net.pth with resnet50 pretrained weights (i guess from Imagenet) and the retinanet specific layers are initialised with gaussian distribution.
This net.pth that is created is not trained at all on any model.
That is what i did, and training (for a specific set i trained on) starts with a loss at 2.1, then degraded while training.

@sunshine-zkf
Copy link
Author

Have you used the script get_state_dict.py ?
This initializes the net.pth with resnet50 pretrained weights (i guess from Imagenet) and the retinanet specific layers are initialised with gaussian distribution.
This net.pth that is created is not trained at all on any model.
That is what i did, and training (for a specific set i trained on) starts with a loss at 2.1, then degraded while training.

yes, i used the script get_state_dict.py and generated the net.pth.Do you train on the voc ? How do i know that the train is right.

@wvalcke
Copy link

wvalcke commented Jun 19, 2019

I started training on Pascal VOC set, loss starts at 1.4
But during the first test evaluation it fails to load the test images, i cant' find them, from where have you downloaded those ?

@wvalcke
Copy link

wvalcke commented Jun 19, 2019

I took the loss implementation from Issue #52 and started training on VOC
The loss started with the value 0.7, training seems to be more stable than with the original code, as sometimes it went to 'nan'.

@sunshine-zkf
Copy link
Author

I downloaded the images from VOC2007test, but I runed the test.py , there are many boxes on the detected image, I think there's a problem with that code. And you?
I use the loss from issue #52 ,the loss is very low ,but is stable.
Can I add you Wechat?

@wvalcke
Copy link

wvalcke commented Jun 26, 2019

I trained on the VOC dataset and saw that with the loss of #52 it trained, but the results were NOK. (hundreds of boxes detected)
I changed the loss function to the definition below, i retrained from scratch and after training i tested on one of the images. Now the objects were correctly detected.

 def focal_loss_alt(self, x, y):
    '''Focal loss alternative.

    Args:
      x: (tensor) sized [N,D].
      y: (tensor) sized [N,].

    Return:
      (tensor) focal loss.
    '''
    alpha = 0.25

    t = one_hot_embedding(y.data.cpu(), 1+self.num_classes)
    t = t[:,1:]
    t = Variable(t).cuda()

    xt = x*(2*t-1)  # xt = x if t > 0 else -x
    pt = (2*xt+1).sigmoid()
    pt = pt.clamp(1e-7, 1.0)

    w = alpha*t + (1-alpha)*(1-t)
    loss = -w*pt.log() / 2
    return loss.sum()

@sunshine-zkf
Copy link
Author

I trained on the VOC dataset and saw that with the loss of #52 it trained, but the results were NOK. (hundreds of boxes detected)
I changed the loss function to the definition below, i retrained from scratch and after training i tested on one of the images. Now the objects were correctly detected.

 def focal_loss_alt(self, x, y):
    '''Focal loss alternative.

    Args:
      x: (tensor) sized [N,D].
      y: (tensor) sized [N,].

    Return:
      (tensor) focal loss.
    '''
    alpha = 0.25

    t = one_hot_embedding(y.data.cpu(), 1+self.num_classes)
    t = t[:,1:]
    t = Variable(t).cuda()

    xt = x*(2*t-1)  # xt = x if t > 0 else -x
    pt = (2*xt+1).sigmoid()
    pt = pt.clamp(1e-7, 1.0)

    w = alpha*t + (1-alpha)*(1-t)
    loss = -w*pt.log() / 2
    return loss.sum()

@sunshine-zkf
Copy link
Author

Why is it modified like this? I don't quite understand xt. I used the author another repo that is torchcv.but I get 20.3map in 2007testvoc. can i see you code modified?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants