Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why GRN layer called a Normalization? #82

Open
SnifferCaptain opened this issue Dec 15, 2024 · 0 comments
Open

Why GRN layer called a Normalization? #82

SnifferCaptain opened this issue Dec 15, 2024 · 0 comments

Comments

@SnifferCaptain
Copy link

Here are the code:
class GRN(nn.Module):
""" GRN (Global Response Normalization) layer
"""
def init(self, dim):
super().init()
self.gamma = nn.Parameter(torch.zeros(1, 1, 1, dim))
self.beta = nn.Parameter(torch.zeros(1, 1, 1, dim))

def forward(self, x):
    Gx = torch.norm(x, p=2, dim=(1,2), keepdim=True)
    Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
    return self.gamma * (x * Nx) + self.beta + x

Let's suppose there is a tensor with 4 channels: ABCD. feature map A is highly activated, and BCD is not so activated.
then we got Gx, GxA is also larger than others.
and then we goto Nx. Nx is now scaled to (0, 1], but still, NxA is larger than NxBCD
in the last line, x was multiplied by Nx, so feature map A, which is haghly activated, was multiplied by a relatively large number, and BCD was multiplied by a small number, which result in A is now more activated and BCD is even less activated.

In a nut shell, this layer's function is to amplify the originally relatively large features even more, while suppressing those features that are relatively unactivated. This may help the model to distinguish different features of the object, which will perform better in small model.

However, in the huge model regime, the model is slightly lagged behind. This refer me of the Johnson-Lindenstrauss lemma. large vector can express much more than the size of the dimention. GRN layer is kind of amplifying local features, but when faced with larger vectors, which carry more semantic information, it might suppress complex information into simpler information. This might be the reason why it performs not so well in large scale model.

Here are two ideas ( I am not qualified for experiment, lol) :
change the GRN layer to Instance norm in deep layer of the large model.
or try this (input is [batch, h*w, channels]):
class GeluNat(nn.Module):
def init(self, channels:int):
super(GeluNat, self).init()
self.cd=max(16, channels//16)
self.w=nn.Linear(channels,self.cd)

def forward(self, x):
    v=torch.var(nn.Softmax(-1)(self.w(x)),unbiased=True)# /n-1
    v=v.mul_(self.cd)# n/n-1 --> max=1
    return nn.GELU()(x*nn.Tanh()(v))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant