My Research Notes: What are Transformers

Transformers are a type of deep learning architecture that have revolutionized the fields of natural language processing (NLP), computer vision, and more. They were introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017 and have since become a foundational model for various tasks such as text generation, translation, image recognition, and even multi-modal learning.

Key Concepts Behind Transformers

Attention Mechanism: At the core of the Transformer model is the self-attention mechanism, which allows the model to weigh the importance of different words or tokens in a sequence relative to each other. This enables the model to capture long-range dependencies and relationships between tokens, which traditional recurrent architectures like LSTMs struggle with.
Parallelization: Unlike recurrent neural networks (RNNs) that process input sequences one step at a time, Transformers process the entire sequence simultaneously, making them highly efficient for training on large datasets.
Architecture Overview: A Transformer model typically consists of an encoder and a decoder, each made up of multiple layers. However, for tasks like text classification or language modeling, only the encoder or the decoder may be used.
- Encoder: Encodes the input sequence into a set of continuous representations. The encoder is used in tasks like text classification and feature extraction.
- Decoder: Decodes these representations into an output sequence. The decoder is primarily used in tasks like text generation or translation.

Transformer Architecture Components

Input Embeddings: Before feeding text into the model, the words or tokens are converted into vector embeddings using techniques like Word2Vec, GloVe, or a learned embedding layer.
Positional Encoding: Since the Transformer processes the input as a whole rather than sequentially, it needs a way to incorporate the order of tokens. Positional encoding adds information about the position of each token in the sequence, ensuring that the model is aware of word order.
Self-Attention Mechanism: The self-attention mechanism calculates a weighted sum of all input tokens, where the weights (attention scores) are determined by the importance of each token relative to others. This allows the model to focus on relevant parts of the input sequence. The self-attention process involves:
- Query ( $Q$ ): A vector that represents the token for which attention is being calculated.
- Key ( $K$ ): A vector that represents other tokens in the sequence.
- Value ( $V$ ): A vector that contains the information of the tokens that need to be attended to.
Multi-Head Attention: Instead of computing a single attention score, the Transformer uses multiple attention heads to capture different relationships and features. Each head performs self-attention independently, and their outputs are concatenated and linearly transformed.
Feed-Forward Neural Network: After the attention mechanism, the output is passed through a feed-forward neural network, which consists of two linear transformations with a ReLU activation in between. This helps to further process and learn from the attended features.
Residual Connections and Layer Normalization: To ensure stable training and better gradient flow, residual connections (or skip connections) are added around each attention and feed-forward sub-layer, followed by layer normalization.

Encoder and Decoder Structure

Encoder: The encoder is a stack of identical layers, each consisting of:
- A multi-head self-attention mechanism.
- A feed-forward neural network.
- Residual connections and layer normalization around both components.
Decoder: The decoder also consists of a stack of identical layers, but with a slightly different structure:
- A multi-head self-attention mechanism that only attends to earlier tokens (to maintain the autoregressive property).
- An encoder-decoder attention mechanism that attends to the encoder's output.
- A feed-forward neural network.
- Residual connections and layer normalization.

How Transformers Work in Sequence-to-Sequence Tasks

For tasks like language translation:

The input text is first embedded and fed into the encoder, which produces a sequence of continuous representations.
The decoder uses these representations and generates the output sequence one token at a time, attending to both the previously generated tokens and the encoder's output.

Applications of Transformers

Natural Language Processing (NLP):
- Machine Translation: Models like OpenAI’s GPT and Google’s BERT are based on Transformers and excel at translating text from one language to another.
- Text Summarization: Generating concise summaries of long documents.
- Sentiment Analysis: Analyzing the sentiment of text data for applications like social media monitoring.
- Question Answering: Systems that can understand and respond to questions based on a given context.
Computer Vision: Vision Transformers (ViT) have been used for image classification and object detection, where they split an image into patches and process them similarly to how text tokens are processed.
Audio Processing: Transformers are used in tasks like automatic speech recognition (ASR) and music generation by learning from sequences of audio features.
Multimodal Learning: Transformers have been extended to handle multiple data modalities, such as combining text and images for tasks like visual question answering and image captioning.

Popular Transformer Models

BERT (Bidirectional Encoder Representations from Transformers): A pre-trained model that uses a bidirectional encoder to understand the context of words in all directions. It is commonly used for text classification, named entity recognition, and more.
GPT (Generative Pre-trained Transformer): A model that uses a decoder-only architecture and is designed for text generation tasks. It is unidirectional and generates text in an autoregressive manner.
T5 (Text-to-Text Transfer Transformer): A model that treats all NLP tasks as text-to-text problems, converting inputs and outputs into text sequences.
Vision Transformer (ViT): Applies the Transformer architecture to image data by treating images as sequences of patches, similar to words in a text sequence.
CLIP (Contrastive Language-Image Pretraining): A multimodal model that learns to associate images and text descriptions using a contrastive learning approach.

Advantages of Transformers

Parallelization: Transformers are highly parallelizable, allowing them to be trained efficiently on large datasets using GPUs.
Long-Range Dependencies: The self-attention mechanism captures dependencies between tokens regardless of their distance in the sequence, making Transformers effective for modeling long text or data sequences.
Versatility: Transformers have been adapted for a wide range of tasks and modalities, from text and images to audio and multimodal applications.

Limitations of Transformers

Computationally Expensive: Transformers require a lot of computational resources and memory, especially for long sequences, as the self-attention mechanism has a time complexity of $O(n^2)$ , where $n$ is the sequence length.
Data-Hungry: Transformers often need large amounts of training data to achieve good performance, which can be a limitation in domains with limited labeled data.
Overfitting: Due to their high capacity, Transformers can easily overfit if not properly regularized or trained with sufficient data.

Summary

Transformers are a game-changing architecture in deep learning that use self-attention mechanisms to capture complex relationships in data. They are the backbone of many state-of-the-art models in NLP and computer vision and have set new performance benchmarks across multiple tasks. By enabling parallelization and handling long-range dependencies efficiently, Transformers have paved the way for large-scale models like GPT, BERT, and Vision Transformers.

My Research Notes

Sunday, 17 November 2024

What are Transformers