Transformer

Since Attention mechanism was proposed, the seq2seq model with attention has been improved in each task, so the current seq2seq model usually refers to the model combining RNN and Attention. The specific models can refer to previous posts. After then, Google proposed a transformer model to solve the seq2seq problem, replacing the LSTM with a full attention structure and achieved better results in the translation task. This post will focus on Attention is All You Need.

Transformer Model

Same as most seq2seq model, transformer also consists of Encoder and Decoder.

Encoder

Encoder consists of N = 6 identical layers. The layer refers to the unit on the left side of the model figure. There is an Nx on the left, here is *6. Each layer consists of two sub-layers: a multi-head self-attention mechanism and a fully connected feed-forward network. Each sub-layer adds a backup connection and normalization, so the output of the sub-layer can be expressed as:

sublayer

Key, Value and Query

The major component in the transformer is the unit of multi-head self-attention mechanism. The transformer views the encoded representation of the input as a set of key-value pairs, (K, V), both of dimension n (input sequence length); in the previous output is compressed into a query (Q of dimension m) and the next output is produced by mapping this query and the set of keys and values.

Multi-head Self-Attention

Attention can be expressed as:

sublayer

Multi-head attention is the project Q, K V through h different linear transformations, and finally splicing different attention results:

sublayer

sublayer

In self-attention, Q, K, V are set same.

In the paper, they use scaled dot-product to calculate attention:

sublayer

Position-wise Feed-forward Network

Decoder

The structure of Decoder and Encoder are similar, but there is an additional sub-layer of attention. Here we first define the input, output and decoding process of the Decoder:

sublayer

Positional Encoding

In addition to the Encode and Decoder, there are also data preprocessing parts. Transformer discards RNN, while the biggest advantage of RNN is the abstraction of data in time series. Hence, the author proposes two methods of Positional Encoding, which sums the encoded data with the embedding data with relative position information.

They both give the same results, so we list first method as follows:

sublayer

sublayer

sublayer

sublayer

Full Architecture

We can summarize the full architecture as follows:

Full Architecture