Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miniscule Nitpicks. (1) Bad default gain for model initialization (2) misapplication of nn.ModuleList #38

Open
mudballs opened this issue Apr 1, 2023 · 3 comments

Comments

@mudballs
Copy link

mudballs commented Apr 1, 2023

Minuscule nitpicks.

  1. Bad initialization gain value for ReLU nets.

    The initial weights for this feedforward block

    self._mask_layer.apply(_init_weights)
    are set using
    torch.nn.init.xavier_uniform_(module.weight)
    This will initialize the block with uniform weights with a default gain of 1.0. The recommended gain value for ReLU nets is sqrt(2) though. I see the same issue in other files as well.
    self.layers.apply(_init_weights)
    In practice, this rarely matters and should not affect the trained weights or deployed model.

  2. Currently using nn.ModuleList when application calls for nn.Sequential

    It seems that the masknet is composed of primitive blocks that are either called in parallel or in sequence. In the parallel mode, the use of nn.ModuleList to hold the blocks makes sense. However, in the sequential mode, it might be more efficient to organize the blocks using nn.Sequential instead. Currenty, both are done using nn.Modulelist. This doesn't affect the correctness of the code however, it should slightly reduce the runtime during sequential application. Currently, the model is running a pythonic for loop over the blocks in sequence here

    for mask_layer in self._mask_blocks:
    . Replacing with a nn.Sequential block would do the inference natively in C which should be faster and more scalable.

Thanks for making the repo public btw!

@tkatchen
Copy link

tkatchen commented Apr 1, 2023

Hey mudballs. I really liked your post but thought I'd add some slight additions that might justify their process slightly more / add a little bit more depth to your post.

  1. You are spot on, when we have ReLU nets it is very important to not use Xavier initialization as that can cause our gradient to explode/vanish. Notably, using the wrong initialization could cause our training to come to a halt. So, how doesn't theirs / why doesn't this matter to such a large extent? Normalization! As we can see from the mlp.py file you pointed out, given certain config options (which I am going to boldly assume they have set True), there is going to be normalization done on our model. Essentially, this normalization helps our layers mantain a constant variance such that our gradient is going to be more resiliant to exploding/diminishing.

  2. I believe there is a much broader situation to consider here: when would this model be run sequentially? Certainly not in production, but presumably it would be during testing. So the major overarching question would be: how much refactoring would it take to create seperate models for sequential/parallel? Probably an unworthy amount of headache when they can just leave it be and suffer the marginal consequences during test.

edit: thanks for making me refresh my ML knowledge and look at some fun code

@mudballs
Copy link
Author

mudballs commented Apr 1, 2023

Thanks for the explanation! Regarding (1), what your saying makes total sense. Regarding (2), my fault for not explaining this right. Running in parallel here means applying each block to the input and then concatenating the output tensors into a big tensor (which is subsequently processed by a MLP) parallel_app :: Tensor(batchsize, dim) -> Tensor(batchsize, n_blocks*dim). Sequential application uses the output of the last layer as input to the current layer seq_app :: Tensor(batchsize, dim) -> Tensor (batchsize, dim). They are inherently different operations. Honestly, this is a non-issue because it's hard to believe that they are running a pytorch model on a production server so I'm pretty sure someone already optimized this.

@tkatchen
Copy link

tkatchen commented Apr 1, 2023

Ah, I thought about (2) in terms of sequential/parallel computing. You certainly got it right! I see now in the constructor the MaskBlock(mask_block_config, in_features, in_features) vs MaskBlock(mask_block_config, input_size, in_features) whereupon the latter can be wrapped in a sequential to get the C like speeds. I'm very curious what the usecase/how it works of running it in parallel is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants