Transformers Explained | Deep Learning Tutorials

Introduction to Transformers

The Transformer architecture, introduced in the paper "Attention Is All You Need" by Vaswani et al. (2017), revolutionized natural language processing (NLP) by moving away from recurrent neural networks (RNNs) and convolutional neural networks (CNNs) for sequence transduction tasks. Its core innovation lies in the self-attention mechanism, which allows the model to weigh the importance of different words in an input sequence when processing each word.

Key Takeaway

Transformers excel at capturing long-range dependencies in sequential data without the sequential processing limitations of RNNs.

The Architecture: Encoder-Decoder

The original Transformer model consists of an encoder and a decoder. Both are stacks of identical layers.

The Encoder

The encoder's job is to process the input sequence and generate a rich representation. Each encoder layer has two main sub-layers:

Multi-Head Self-Attention: This is where the magic happens. It allows the model to jointly attend to information from different representation subspaces at different positions.
Position-wise Feed-Forward Network: A simple, fully connected feed-forward network applied to each position independently.

Residual connections and layer normalization are used around each of the two sub-layers.

The Decoder

The decoder's job is to generate the output sequence, one token at a time. Each decoder layer has three sub-layers:

Masked Multi-Head Self-Attention: Similar to the encoder's self-attention, but masked to prevent attending to future positions in the output sequence (to ensure autoregressive generation).
Multi-Head Attention over Encoder Output: This allows the decoder to attend to the output of the encoder.
Position-wise Feed-Forward Network: Identical to the encoder's feed-forward network.

Again, residual connections and layer normalization are employed.

The Heart of the Matter: Self-Attention

Self-attention computes a weighted sum of values, where the weight assigned to each value is determined by the similarity between a query vector and a key vector.

For each word, we create three vectors: Query (Q), Key (K), and Value (V) by multiplying the word's embedding with learned weight matrices. The attention score for a word is calculated as:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Here, d_k is the dimension of the key vectors. The scaling factor sqrt(d_k) is used to prevent the dot products from becoming too large, which could lead to very small gradients after the softmax function.

Intuition

Think of it as looking up information. The Query asks a question. The Keys represent what information is available. The Values are the actual information. The attention mechanism finds the best matching Keys for the Query and returns a weighted sum of the corresponding Values.

Multi-Head Attention

Instead of performing a single attention function, multi-head attention linearly projects the queries, keys, and values h times with different learned linear projections. This allows the model to jointly attend to information from different representation subspaces at different positions. The outputs of these h attention layers are then concatenated and linearly projected again.

Positional Encoding

Since the Transformer architecture does not process data sequentially, it needs a way to incorporate information about the relative or absolute position of tokens. This is done through positional encodings, which are added to the input embeddings. These encodings are typically fixed sinusoidal functions of different frequencies, allowing the model to learn to attend based on position.

Why Transformers are Powerful

Parallelization: Unlike RNNs, which process sequences step-by-step, self-attention can be computed in parallel for all positions, leading to significant speedups during training.
Long-Range Dependencies: The direct connections between any two words in the sequence allow Transformers to capture dependencies between distant words more effectively than RNNs.
Contextual Embeddings: The self-attention mechanism inherently produces contextualized embeddings, where the representation of a word depends on its surrounding context.

Applications

Transformers have achieved state-of-the-art results across a wide range of NLP tasks, including:

Machine Translation
Text Summarization
Question Answering
Text Generation (e.g., GPT models)
Sentiment Analysis