Suppose we have (discrete) sequence as a single sample:
$$x_1, \ldots, x_T$$Examples:
We need to have (deep) architectures that process sequential data in order to solve different downstream tasks
The difference between MLP and Recurrent Neural Networks (RNN) is that it should be able to process sequences of different length.
In simple RNN, we introduce a hidden state and process the input using 1 layer:
Elman network:
$$h_t = \sigma(W_h x_t + U_h h_{t-1} + b), \quad y_t = \sigma(W_y h_t + b_y).$$Jordan network:
$$h_t = \sigma_h(W_h x_t + U_h y_{t-1} + b), \quad y_t = \sigma_y(W_y h_t + b_y).$$Suppose we want to solve classification tasks
How do we train the model?
What are possible problems?
If we make $k$ steps, we can represent the resulting code a DNN with $k$ layers.
This process is called unrolling.
We can then differentiate through it in a normal way.
The problems are standard: although we have to learn few parameters, the gradients can vanish.
One of the breakthrough solutions have been long short-term memory model.
Since the computational block of RNN is the same, the Jacobians are the same.
The gradient in backpropagation will have the form $$W^k v$$.
So, if the eigenvalues of $W$ are less than $1$ we get exponential decay.
If the eigenvalues are greater than $1$, we get exponential growth.
If they are one (for example for orthogonal weight matrices) you can get much more stable training of RNN.
(That is why orthogonal parametrization can be important!)
The LSTM unit has the following multiplicative gating mechanism
\begin{align} f_t &= \sigma(W_f[x_t, h_{t-1}] + b_f) \\ i_t &= \sigma(W_i[x_t, h_{t-1}] + b_i) \\ \tilde{C}_t &= \tanh(W_C[x_t, h_{t-1}] + b_C) \\ C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \\ o_t &= \sigma(W_o[x_t, h_{t-1}] + b_o) \\ h_t &= o_t \odot \tanh(C_t) \end{align}where:
GRU (Gated recurrent unit) is a simpler version of LSTM that has been introduced in 2014. It shows similar performancce.
\begin{align} r_t &= \sigma(W_r[x_t, h_{t-1}] + b_r) \\ z_t &= \sigma(W_z[x_t, h_{t-1}] + b_z) \\ \tilde{h}_t &= \tanh(W_h[x_t, r_t \odot h_{t-1}] + b_h) \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \end{align}where:
One of the main challenges in sequence modelling is to predict the next symbol (vector) in the sequence.
Extrapolation or prediction is one of the most important and challenging tasks.
This can be viewed as seq2seq problem: given $x_1, \ldots, x_T$ we want a model to output $x_2, \ldots, x_{T+1}$.
A widely used loss in this context is Connectionist temporal classification (CTC) loss.
CTC loss is quite complicated loss, if we need to map a sequence of one length to the sequence of another length. We add additional 'blank' symbol as an output, use softmax to process RNN/LSTM data and need to compute the sum over all possible alignments.
There a lot of alignments, so dynamic programming needs to be used.
$$\mathcal{L}_{\text{CTC}}=-\sum_{\pi \in \mathcal{B}(y)}p(\pi|x)$$where:
When processing the sequence, all the information about the previous states is summarized in the hidden state.
This may not be enough, and we want to have an opportunity to look at the previous elements in the sequence when we need it.
A natural idea is to consider similar patterns in the sequence, which might help prediction.
Of course, we can just use extended history, but it is typically not enough.
Attention has been first introduced in 2015 in the paper by Bengio et. al on Neural Nachine Translation
The attention weights were a function
$$e_{tj} = a(h_t, s_{t-1}), \quad \alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^T \exp(e_{tk})}$$A seminar paper has been attention is all you need which greatly simplified original architecture of RNN-type models with attention and also introduced the concept of self-attention.
In self-attention, we start from a sequence $X$ parametrized by a matrix of size $N_{seq} \times N_{f}$, where $N_{seq}$ is the length of the sequence, and $N_f$ is the dimension of the vector space representing the sequence.
We need to compute similarities between the vectors in the sequence.
To do so, we map $X$ to query and key values by using linear transformations:
$$Q = X W_Q, K = X W_K.$$The similarity between the $i$-th element of the sequence and the $j$-th element of the sequence is then given as a scalar product between corresponding columns:
$$\hat{M}_{ij} = (q_i, k_j).$$In order to make these attention weights, we use softmax operation along the columns, so
$$M = \mathrm{softmax}\left({\frac{QK^{\top}}{\sqrt{d_k}}}\right).$$The attention matrix has the properties $M_{ij} \geq 0, \sum_j M_{ij} = 1$.
In self-attention block, the input sequence $X$ is being mapped to three matrices $Q, K, V$ (called query, key and value) and the transformation of $X$ to the output is given by
$$Q = X W_Q, \, K = X W_K, \, V = X W_V.$$$$\mathrm{Attention(Q, K, V)} = \mathrm{softmax}\left({\frac{QK^{\top}}{\sqrt{d_k}}}\right) V,$$which is just taking linear combinations of vectors in $V$ with the attention weights.
The attention block allows us to exchange information between each component of a sequence.
Then, we can process the information along the feature dimension using MLP.
Originally, people tried to use convolution, but ended up using MLP, but in some of the code historically we have Conv1D block.
Also, we use not one attention head, but many of smaller dimensions.
If we have H heads, we compute $H$ attention matrices.
The resulting output has the same size, since $Q_i, K_i$ have smaller number of rows.
Thus, we use multihead attention (why?) and feedforward MLP to process the features.
For seq2seq (encoder-decoder models) we can just convert the embeddings to the probabilities of the symbol and then maximize the likelihood of the sequence.
However, one of the most important applications of transformer models is unsupervised learning: we do not have any parallel corpus.
Lets decribe the GPT (Generalized pretrained transformer) model.
In GPT model, we have the data, which each sample being a sequence $x_1, \ldots, x_T$. We want to learn the probability distribution
$$p(x_1, \ldots, x_T)$$We parametrize this distribution in the autoregressive form (typically it is referred to as autoregressive language modelling)
$$p(x_1, \ldots, x_T) = p(x_1) p(x_2 \vert x_1) p(x_3 \vert x_1, x_2) \ldots, $$i.e. we want to learn conditional probabilities of the next symbol given all the previous ones.
This gives the standard objective of the GPT: predict the probability of the next symbol given all the previous ones.
For the efficient training, we parametrize all conditional probabilities using a single Transformer model.
We process the input data $x_1, \ldots, x_T$ by several transformer layers, get embeddings $h_1, \ldots, h_T$.
We use masked attention: at each self-attention step we multiply the mask by a matrix $L$, where
$$L_{ij} = \begin{cases} 1, \quad i \geq j, \\ 0, \mbox{otherwise}\end{cases}.$$This means, that the $j$-th embedding depends only on the previous ones!!
Thus, we can interpret those embeddings (after a linear layer) as logits for conditional probabilities.
I.e. we model conditional probabilities by trying to map $x_1, \ldots, x_T$ to $x_2, \ldots, x_{T+1}$.
In training, we predict the conditional probabilities of the next symbol, and maximize the likelihood of the total sequence.
In inference, we only have the chance to generate the next symbol (token) one-by-one, which is one of the bottlenecks of efficient generation.
GPT is an decoder-only model, which generates the data but does not compute meaningful embeddings of the text.
Original Neural Machine Translation models where encoder-decoder modes; now you can implement encoder-decoder models by using trapezoidal attention mechanism.
There are also pure encoder models with the most famoust being BERT.
Bidirectional Encoder Representations from Transformers (BERT) is an encoder-only model trained in a self-supervised way.
We mask some of the tokens, and then predict them.
Another loss is next sentence prediction: the model receives two sentences and tries to predict if the next sentence is continuation of the first one.
[CLS] token is inserted before the first sentence and [SEP] token is inserted between two sentences.
#!pip install transformers
#import transformers
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Artificial Intelligence [MASK] take over the world.")
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias'] - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[{'score': 0.8617352843284607, 'token': 2515, 'token_str': 'does', 'sequence': 'artificial intelligence does not take over the world.'}, {'score': 0.06553470343351364, 'token': 2097, 'token_str': 'will', 'sequence': 'artificial intelligence will not take over the world.'}, {'score': 0.026606494560837746, 'token': 2106, 'token_str': 'did', 'sequence': 'artificial intelligence did not take over the world.'}, {'score': 0.010828700847923756, 'token': 2064, 'token_str': 'can', 'sequence': 'artificial intelligence can not take over the world.'}, {'score': 0.009001809172332287, 'token': 2071, 'token_str': 'could', 'sequence': 'artificial intelligence could not take over the world.'}]
There are architectures that try to combine encoder and decoder, for example, BART which combines bidirectional and autoregressive transformers:
We can do transformations to the input.
https://huggingface.co/ hosts a huge amount of NLP transformer models and many different pipelines.
Attention scales like $L^2$ where $L$ is the sequence length. For very long sequences it will be time consuming, so quite a long of fast attention variants have been proposed.
For language models, the attention is not the main bottleneck.
Most of the parameters are located in feedforward layers that map $N_f \rightarrow 4 N_f$.
They need to be compressed, quantized, etc.
Recent leaks of LLAMA and related models, people try to quantize them but do not systematically check the accuracy.