Introduce Transformer in 10 minutes
Preface
Transformer becomes most of the popular deep learning model in recent years, especially in NLP domain.
To save the time cost on every time I needs to back to review the model concepts, I write this article as a note and also helps someone who just beginning to start take a look at the model but stuck in some where.
Why Transformer Base Model becomes to the Trend?
To talk about the Transformer, first we need to compare it with the other Seq2Seq model like RNN. The traditional Seq2Seq
model need to compute the previous step results before inference for the current step, Transformer Model can compute input
sequence in parallel by Self-Attention
mechanism which means it can cost down the inference latency while encoding the sequence.
By using Self-Attention
mechanism, Transformer-Base model has more capability to learn the contextual meaning between the sequence rather than traditional Seq2Seq model which just recurrently propagate the result from left to right.
In recent years the research also show up that Transformer-Base model has outperform on the NLP domain which has heavily rely on context comprehension during the inference like BERT, GPT-3, Sentence BERT…
Model Architecture
Transformer is an encoder-decoder architecture, which is similar as the other seq2seq model like RNN and LSTM. Input of the model is the series of label encoded sequence.
Self-Attention
One of the biggest point of the transformer is self-attention mechanism, so what’s it special?
First, each element of the input embedded sequence would be projected into 3 vector by 3 different projection matrix, which after the projection can be annotate in q
, k
, v
vector.
- q: Query for the others embedded vector
k
by dot product - k: To be queried by other vectors
- v: Scaler for each (q, k) dot product pairs An intuitive way to understand the self-attention mechanism is it let the model to learn to recognize the relativity of the embedded element to all the others element of the sequence and no matter the other element is appearing before or after current element
Note:
Self-Attention
is the subset of theAttention
,Self-Attention
focus on dot product of q, k, v is perform on the same sequence, butAttention
do not.
Multi-Headed Attention
After getting q
, k
, v
vectors but before applying dot product to each other, Transformer segment the 3 vectors into multiple parts which has the same dimension. Then we obtain the multiple groups of q
, k
, v
and perform the dot product in each of the group.
This is just like the way we handle for the image inference tasks with the CNN model which has each of the convolution layer containing multiple channels, and each channel can learn the pattern to help the inference and different from another channels.
Positional Encoding
What’s Positional Encoding
A vector has the size align with the input sequence, each element of the vector has unique value in the interval [-1, 1]. The positional scalar would be added on the embedded vectors before it goes to attention mechanism.
Why Transformer Need the Positional Encoding
Compares to another Seq2Seq model like RNN, attention mechanism don’t show the different no matter the order of the input of the embedded vector is changed or not.
This is not like RNN or LSTM the current step inference would affect by previous steps which means the order of the sequence would be learned by the model. To help the Transformer recognized the pattern of the element position during performing attention, here’s why positional encoding comes out.
Encoder
The output of the Transformer Encoder can be represented as the input sequence after the feature transformation. If the input of the sequence is a sentence, then the output vector would contain semantic meaning in some extent.
The Encoder is composited by multiple layers of self-attention and feed forward neural network pairs.
Decoder
The main different between encoder and decoder is the input sequence and attention mechanism.
Shifted Right in Input Sequence
The input sequence may be the most confuse part when first looking at the paper. Why the input should be shifted right by one compare with the output of the decoder?
Though the reason is simple, think about how the text generator performs, the next word must be followed by the previous generated word. So during the training step, the encoder
and decoder
is just like how we perform on RNN or LSTM model.
Attention
Attention mechanism in decoder takes the output of the encoder as input k
, v
vector, and input embedded vectors of the decoder would be project to q
vectors and perform dot product with k
, v
.
In my comprehension, the decoder make use of the semantic feature of the sentence as query vector and compute the projection score from each of the embedded vector from the sequence to show which element of the sequence show be focused on.
Masking
In previous part we neglect some details of training. How to deal with different size of input sequence? How to handle the problem that input size of the model is fixed but the input sequence should be recurrently increased while training?
Padding
To answer the first question, the purpose of padding is to mask the redundant part of the input sequence. In practice, the
padding would be a list of boolean array which masked part is set as False
.
Sequence Mask
To answer the second question, the sequence mask is a lower triangular matrix with value 1. In practice each sequence of training data would be multiplied by the sequence mask element-wisely when training start, this means the recurrent part of the training can be performed in parallel way and accelerated by matrix operation.
Sequence mask takes the advantage of the attention mechanism rather than the traditional seq2seq model which the next training step must have to wait until previous steps done, the training process of each sequence of the transformer can just finished in one step.
|
|