432 view times

Attention Is All You Need


traditionally, RNN is used in word encoding. However, the drawback is that it could not calculate parallelly and learn the global infomation effeciently. So, CNN are introduced. However, the problem of CNN in NLP is that model is so big. Here, Google provides a new idea: attention. In general, attention get the global info by comparing each word infomation.

Model architecture

This model actually is a seq2seq model. Left is a encoder and right is a decoder. The figure below illustrate clearly.

In encoder, n=4, 2 sublayers.
The first sublayer is a self-attention mechanism.
The second sublayer is common fully connected layers.

Decoder: n=6, 3 sublayers
1 self attention module
2 combine encoder info with attention
3 fully connected layer.

Finally, linear and softmax layer are used.

Attention mechanism

The summary of attention unit are shown aboven in the figure and equation 1. Next, we will explain in detail.

First, \(QK^T\) caculate the similarity of each embedding vector. (here, the k from each word will be calculated)
Second, softmax function normalize each embedding and get the possibility. By the way, goal of dividing by \(\sqrt{d_k}\) is to keep from gradient vanish.
Third, scale by V. So last step, the probility what we get could be considered as a sift to pick up inportant feature.

Multihead concate all attention together.

Position embedding

Because the model can not capture the position information for each word. Position embedding is introduced to get sequence order info. And in this model, position embedding is the only way to get position info.

pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos(cosine law?).

class PositionalEncoding(nn.Module):

    def __init__(self, d_hid, n_position=200):
        super(PositionalEncoding, self).__init__()

        # Not a parameter
        self.register_buffer('pos_table', self._get_sinusoid_encoding_table(n_position, d_hid))

    def _get_sinusoid_encoding_table(self, n_position, d_hid):
        ''' Sinusoid position encoding table '''
        # TODO: make it with torch instead of numpy

        def get_position_angle_vec(position):
            return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]

        sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
        sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])  # dim 2i
        sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])  # dim 2i+1

        return torch.FloatTensor(sinusoid_table).unsqueeze(0)

    def forward(self, x):
        return x + self.pos_table[:, :x.size(1)].clone().detach()


no all nlp require the global info, some task depend more on local semantic information. So, attention model is not suitbale for these taskes.


  1. https://www.cnblogs.com/gczr/p/10114099.html
  2. http://jalammar.github.io/illustrated-transformer/
  3. https://www.jiqizhixin.com/articles/2018-01-10-20
  4. https://arxiv.org/pdf/1706.03762.pdf
  5. https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/fec78a687210851f055f792d45300d27cc60ae41/transformer/Models.py#L23



电子邮件地址不会被公开。 必填项已用*标注