Why GRN layer called a Normalization？ #82

SnifferCaptain · 2024-12-15T13:06:41Z

Here are the code:
class GRN(nn.Module):
""" GRN (Global Response Normalization) layer
"""
def init(self, dim):
super().init()
self.gamma = nn.Parameter(torch.zeros(1, 1, 1, dim))
self.beta = nn.Parameter(torch.zeros(1, 1, 1, dim))

def forward(self, x):
    Gx = torch.norm(x, p=2, dim=(1,2), keepdim=True)
    Nx = Gx / (Gx.mean(dim=-1, keepdim=True) + 1e-6)
    return self.gamma * (x * Nx) + self.beta + x

Let's suppose there is a tensor with 4 channels: ABCD. feature map A is highly activated, and BCD is not so activated.
then we got Gx, GxA is also larger than others.
and then we goto Nx. Nx is now scaled to (0, 1], but still, NxA is larger than NxBCD
in the last line, x was multiplied by Nx, so feature map A, which is haghly activated, was multiplied by a relatively large number, and BCD was multiplied by a small number, which result in A is now more activated and BCD is even less activated.

In a nut shell, this layer's function is to amplify the originally relatively large features even more, while suppressing those features that are relatively unactivated. This may help the model to distinguish different features of the object, which will perform better in small model.

However, in the huge model regime, the model is slightly lagged behind. This refer me of the Johnson-Lindenstrauss lemma. large vector can express much more than the size of the dimention. GRN layer is kind of amplifying local features, but when faced with larger vectors, which carry more semantic information, it might suppress complex information into simpler information. This might be the reason why it performs not so well in large scale model.

Here are two ideas ( I am not qualified for experiment, lol) :
change the GRN layer to Instance norm in deep layer of the large model.
or try this (input is [batch, h*w, channels]):
class GeluNat(nn.Module):
def init(self, channels:int):
super(GeluNat, self).init()
self.cd=max(16, channels//16)
self.w=nn.Linear(channels,self.cd)

def forward(self, x):
    v=torch.var(nn.Softmax(-1)(self.w(x)),unbiased=True)# /n-1
    v=v.mul_(self.cd)# n/n-1 --> max=1
    return nn.GELU()(x*nn.Tanh()(v))

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why GRN layer called a Normalization？ #82

Why GRN layer called a Normalization？ #82

SnifferCaptain commented Dec 15, 2024

Why GRN layer called a Normalization？ #82

Why GRN layer called a Normalization？ #82

Comments

SnifferCaptain commented Dec 15, 2024