You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's suppose there is a tensor with 4 channels: ABCD. feature map A is highly activated, and BCD is not so activated.
then we got Gx, GxA is also larger than others.
and then we goto Nx. Nx is now scaled to (0, 1], but still, NxA is larger than NxBCD
in the last line, x was multiplied by Nx, so feature map A, which is haghly activated, was multiplied by a relatively large number, and BCD was multiplied by a small number, which result in A is now more activated and BCD is even less activated.
In a nut shell, this layer's function is to amplify the originally relatively large features even more, while suppressing those features that are relatively unactivated. This may help the model to distinguish different features of the object, which will perform better in small model.
However, in the huge model regime, the model is slightly lagged behind. This refer me of the Johnson-Lindenstrauss lemma. large vector can express much more than the size of the dimention. GRN layer is kind of amplifying local features, but when faced with larger vectors, which carry more semantic information, it might suppress complex information into simpler information. This might be the reason why it performs not so well in large scale model.
Here are two ideas ( I am not qualified for experiment, lol) :
change the GRN layer to Instance norm in deep layer of the large model.
or try this (input is [batch, h*w, channels]):
class GeluNat(nn.Module):
def init(self, channels:int):
super(GeluNat, self).init()
self.cd=max(16, channels//16)
self.w=nn.Linear(channels,self.cd)
Here are the code:
class GRN(nn.Module):
""" GRN (Global Response Normalization) layer
"""
def init(self, dim):
super().init()
self.gamma = nn.Parameter(torch.zeros(1, 1, 1, dim))
self.beta = nn.Parameter(torch.zeros(1, 1, 1, dim))
Let's suppose there is a tensor with 4 channels: ABCD. feature map A is highly activated, and BCD is not so activated.
then we got Gx, GxA is also larger than others.
and then we goto Nx. Nx is now scaled to (0, 1], but still, NxA is larger than NxBCD
in the last line, x was multiplied by Nx, so feature map A, which is haghly activated, was multiplied by a relatively large number, and BCD was multiplied by a small number, which result in A is now more activated and BCD is even less activated.
In a nut shell, this layer's function is to amplify the originally relatively large features even more, while suppressing those features that are relatively unactivated. This may help the model to distinguish different features of the object, which will perform better in small model.
However, in the huge model regime, the model is slightly lagged behind. This refer me of the Johnson-Lindenstrauss lemma. large vector can express much more than the size of the dimention. GRN layer is kind of amplifying local features, but when faced with larger vectors, which carry more semantic information, it might suppress complex information into simpler information. This might be the reason why it performs not so well in large scale model.
Here are two ideas ( I am not qualified for experiment, lol) :
change the GRN layer to Instance norm in deep layer of the large model.
or try this (input is [batch, h*w, channels]):
class GeluNat(nn.Module):
def init(self, channels:int):
super(GeluNat, self).init()
self.cd=max(16, channels//16)
self.w=nn.Linear(channels,self.cd)
The text was updated successfully, but these errors were encountered: