I just completed the DeepLearning.ai’s NLP specialization on Coursera, went through the Stanford CS224n course in NLP and read a bunch of journal articles. Sorting through the alphabet soup was an undertaking in itself. Since I kept referring to my notes to compare features between the different language models and looking up benchmark datasets & sources, I figured I’d pop the charts in here in case they’re helpful to others.
Transformer specs indicate maximum values for: L = # layers/blocks, A = # attention heads, D = # dimensions, P = # params.
year | model | description | specs |
---|---|---|---|
2013 | word2vec | Word Representations in Vectors | |
2014 | GloVe | Global Vectors | |
2018 | GPT | Generative Pre-trained Transformer | L=12, A=12, H=768 |
2019 | GPT-2 | Unsupervised Multitask Learning | L=48, A=48, H=1600 |
2020 | GPT-3 | Few-Shot Learners | L=96, A=96, H=12288 |
2018 | BERT | Bidirectional Encoder Representation for Transformers | L=24, A=16, H=1024 |
2019 | RoBERTa | Robustly Optimized BERT | L=24, A=16, H=1024 |
2019 | T5 | Transfer Learning with Text-to-Text Transformer | L=24, A=128, H=768 |
term | expanded | notes |
---|---|---|
NLP | Natural Language Processing | |
NLU | Natural Language Understanding | |
NLI | Natural Language Inference | |
DPR | Definite Pronoun Resolution | |
AFS | Argument Facet Similarity | |
BiDAF | Bidirection Attention Flow | |
CoVe | Contextualized Word Vectors | |
HSIC | Hilbert-Schmidt Independence Criterion | |
PMI | Pointwise Mutual Information | |
UDA | Unsupervised Data Augmentation | |
RL2 | Reinforcement Learning Fast & Slow | |
MAML | Model-Agnostic Meta-Learning | |
WMT | Workshop in Machine Translation | |
SemEval | Workshop on Semantic Evaluatio | |
BPE | Byte-Pair Encoding | Used to tokenize & build vocabulary lists |
TF-IDF | Term Frequency–Inverse Document Frequency | Reflects word importance for document in corpus |