🔍 Objective
The paper introduces BERT (Bidirectional Encoder Representations from Transformers), a language representation model designed to pre-train deep bidirectional representations from unlabeled text and fine-tune them for various NLP tasks.
🧠 Core Innovations
-
Masked Language Modeling (MLM):
-
Instead of traditional left-to-right or right-to-left training, BERT masks random words and learns to predict them using context from both directions.
-
-
Next Sentence Prediction (NSP):
-
Helps model understand sentence relationships—whether a given sentence B follows sentence A.
-
🏗️ Architecture
-
Based on the Transformer encoder (Vaswani et al., 2017).
-
Two versions:
-
BERTBASE: 12 layers, 768 hidden size, 12 attention heads (110M params).
-
BERTLARGE: 24 layers, 1024 hidden size, 16 attention heads (340M params).
-
🛠️ Training & Fine-tuning
-
Pre-trained on:
-
BooksCorpus (800M words)
-
English Wikipedia (2,500M words)
-
-
Fine-tuned by adding minimal output layers for specific NLP tasks (e.g., classification, QA, NER).
-
All model parameters are fine-tuned end-to-end.
📊 Performance Highlights
BERT achieved state-of-the-art results on:
-
GLUE Benchmark (e.g., MNLI, QQP, QNLI, etc.)
-
SQuAD v1.1 & v2.0 (QA)
-
SWAG (commonsense inference)
-
Often outperformed previous models like OpenAI GPT and ELMo by large margins.
🔬 Ablation Studies
-
Bidirectionality and NSP were crucial for high performance.
-
Larger models (BERTLARGE) consistently performed better, especially on smaller datasets.
-
BERT also proved effective in feature-based settings, though fine-tuning offered better results.
🧩 Impact
-
Shifted the NLP paradigm from task-specific models to pre-trained transformer models with fine-tuning.
-
Enabled strong performance even with little labeled data.
-
Influenced a new wave of models (e.g., RoBERTa, ALBERT, DistilBERT).
No comments:
Post a Comment