Miniscule Nitpicks. (1) Bad default gain for model initialization (2) misapplication of `nn.ModuleList` #38

mudballs · 2023-04-01T00:25:43Z

Minuscule nitpicks.

Bad initialization gain value for ReLU nets.

The initial weights for this feedforward block

the-algorithm-ml/projects/home/recap/model/mask_net.py

Line 39 in 78c3235

self._mask_layer.apply(_init_weights)

are set using

the-algorithm-ml/projects/home/recap/model/mask_net.py

Line 10 in 78c3235

torch.nn.init.xavier_uniform_(module.weight)

This will initialize the block with uniform weights with a default gain of 1.0. The recommended gain value for ReLU nets is sqrt(2) though. I see the same issue in other files as well.

the-algorithm-ml/projects/home/recap/model/mlp.py

Line 42 in 78c3235

self.layers.apply(_init_weights)

In practice, this rarely matters and should not affect the trained weights or deployed model.
Currently using nn.ModuleList when application calls for nn.Sequential

It seems that the masknet is composed of primitive blocks that are either called in parallel or in sequence. In the parallel mode, the use of nn.ModuleList to hold the blocks makes sense. However, in the sequential mode, it might be more efficient to organize the blocks using nn.Sequential instead. Currenty, both are done using nn.Modulelist. This doesn't affect the correctness of the code however, it should slightly reduce the runtime during sequential application. Currently, the model is running a pythonic for loop over the blocks in sequence here

the-algorithm-ml/projects/home/recap/model/mask_net.py

Line 94 in 78c3235

for mask_layer in self._mask_blocks:

. Replacing with a nn.Sequential block would do the inference natively in C which should be faster and more scalable.

Thanks for making the repo public btw!

The text was updated successfully, but these errors were encountered:

tkatchen · 2023-04-01T03:54:03Z

Hey mudballs. I really liked your post but thought I'd add some slight additions that might justify their process slightly more / add a little bit more depth to your post.

You are spot on, when we have ReLU nets it is very important to not use Xavier initialization as that can cause our gradient to explode/vanish. Notably, using the wrong initialization could cause our training to come to a halt. So, how doesn't theirs / why doesn't this matter to such a large extent? Normalization! As we can see from the mlp.py file you pointed out, given certain config options (which I am going to boldly assume they have set True), there is going to be normalization done on our model. Essentially, this normalization helps our layers mantain a constant variance such that our gradient is going to be more resiliant to exploding/diminishing.
I believe there is a much broader situation to consider here: when would this model be run sequentially? Certainly not in production, but presumably it would be during testing. So the major overarching question would be: how much refactoring would it take to create seperate models for sequential/parallel? Probably an unworthy amount of headache when they can just leave it be and suffer the marginal consequences during test.

edit: thanks for making me refresh my ML knowledge and look at some fun code

mudballs · 2023-04-01T04:31:35Z

Thanks for the explanation! Regarding (1), what your saying makes total sense. Regarding (2), my fault for not explaining this right. Running in parallel here means applying each block to the input and then concatenating the output tensors into a big tensor (which is subsequently processed by a MLP) parallel_app :: Tensor(batchsize, dim) -> Tensor(batchsize, n_blocks*dim). Sequential application uses the output of the last layer as input to the current layer seq_app :: Tensor(batchsize, dim) -> Tensor (batchsize, dim). They are inherently different operations. Honestly, this is a non-issue because it's hard to believe that they are running a pytorch model on a production server so I'm pretty sure someone already optimized this.

tkatchen · 2023-04-01T05:08:41Z

Ah, I thought about (2) in terms of sequential/parallel computing. You certainly got it right! I see now in the constructor the MaskBlock(mask_block_config, in_features, in_features) vs MaskBlock(mask_block_config, input_size, in_features) whereupon the latter can be wrapped in a sequential to get the C like speeds. I'm very curious what the usecase/how it works of running it in parallel is?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miniscule Nitpicks. (1) Bad default gain for model initialization (2) misapplication of `nn.ModuleList` #38

Miniscule Nitpicks. (1) Bad default gain for model initialization (2) misapplication of `nn.ModuleList` #38

mudballs commented Apr 1, 2023 •

edited

Loading

tkatchen commented Apr 1, 2023 •

edited

Loading

mudballs commented Apr 1, 2023

tkatchen commented Apr 1, 2023

Miniscule Nitpicks. (1) Bad default gain for model initialization (2) misapplication of nn.ModuleList #38

Miniscule Nitpicks. (1) Bad default gain for model initialization (2) misapplication of nn.ModuleList #38

Comments

mudballs commented Apr 1, 2023 • edited Loading

tkatchen commented Apr 1, 2023 • edited Loading

mudballs commented Apr 1, 2023

tkatchen commented Apr 1, 2023

Miniscule Nitpicks. (1) Bad default gain for model initialization (2) misapplication of `nn.ModuleList` #38

Miniscule Nitpicks. (1) Bad default gain for model initialization (2) misapplication of `nn.ModuleList` #38

mudballs commented Apr 1, 2023 •

edited

Loading

tkatchen commented Apr 1, 2023 •

edited

Loading