I just completed the DeepLearning.ai’s NLP specialization on Coursera, went through the Stanford CS224n course in NLP and read a bunch of journal articles. Sorting through the alphabet soup was an undertaking in itself. Since I kept referring to my notes to compare features between the different language models and looking up benchmark datasets & sources, I figured I’d pop the charts in here in case they’re helpful to others.
Transformer specs indicate maximum values for: L = # layers/blocks, A = # attention heads, D = # dimensions, P = # params.
| year | model | description | specs |
|---|---|---|---|
| 2013 | word2vec | Word Representations in Vectors | |
| 2014 | GloVe | Global Vectors | |
| 2018 | GPT | Generative Pre-trained Transformer | L=12, A=12, H=768 |
| 2019 | GPT-2 | Unsupervised Multitask Learning | L=48, A=48, H=1600 |
| 2020 | GPT-3 | Few-Shot Learners | L=96, A=96, H=12288 |
| 2018 | BERT | Bidirectional Encoder Representation for Transformers | L=24, A=16, H=1024 |
| 2019 | RoBERTa | Robustly Optimized BERT | L=24, A=16, H=1024 |
| 2019 | T5 | Transfer Learning with Text-to-Text Transformer | L=24, A=128, H=768 |
| term | expanded | notes |
|---|---|---|
| NLP | Natural Language Processing | |
| NLU | Natural Language Understanding | |
| NLI | Natural Language Inference | |
| DPR | Definite Pronoun Resolution | |
| AFS | Argument Facet Similarity | |
| BiDAF | Bidirection Attention Flow | |
| CoVe | Contextualized Word Vectors | |
| HSIC | Hilbert-Schmidt Independence Criterion | |
| PMI | Pointwise Mutual Information | |
| UDA | Unsupervised Data Augmentation | |
| RL2 | Reinforcement Learning Fast & Slow | |
| MAML | Model-Agnostic Meta-Learning | |
| WMT | Workshop in Machine Translation | |
| SemEval | Workshop on Semantic Evaluatio | |
| BPE | Byte-Pair Encoding | Used to tokenize & build vocabulary lists |
| TF-IDF | Term Frequency–Inverse Document Frequency | Reflects word importance for document in corpus |