GPT-2 Architecture

def. GPT-2 is OpenAI’s first public Large Language Model

Note

GPT in 60 Lines of NumPy | Jay Mody

But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning - YouTube and Attention in transformers, visually explained | Chapter 6, Deep Learning - YouTube

Parameters

Hyperparameters

Context Length: number of tokens allowed in input
- Sequence length < Context Length. Sequence is the actual length of text input
Vocabulary Size: the tokenizer’s size of vocabulary
Embedding Dimension: the number of dimensions in the word embedding
Head Count: number of attention heads in each attention layer

Learnable Parameters

wte (Vocab Size → Embedding Dimension): lookup table from token to embedding
wpe (Context Length → Embedding Dimension): lookup table for positional embedding
$γ, β$ in the Normalization layer
w,b (N → N), (N): Weights and biases in the linear (=linear) layers
Attention
- Query, Key, and Value matricies, each (Sequence Length → Embed. Dim.):

Word Embeddings

Given a string of text, a tokenizer will split this text into chunks (tokens)
Learnable embedding matrix wte and wpe encodes this into an embedding These vectors $v_{1}, v_{2}$ , embedded in the embedding space will:

Be similar in meaning if cosine similarity $- 1 < \frac{v _{1} \cdot v _{2}}{∣ v _{1} ∣∣ v _{2} ∣} < 1$ is larger
Difference in vectors $v_{2} - v_{1}$ i.e. the direction embed meaning that is the “difference in meaning” between $v_{1}$ and $v_{2}$

Attention

Each token in string is multiplied with a Query matrix $Q_{1} = W_{Q} \cdot v_{1}$
1. The resulting vector encodes the question, e.g. “are there adjectives in front of me?”
2. The resulting vector
Each token in string is multiplied with a Key matrix $K_{1} = W_{K} \cdot v_{1}$
1. The resulting vector encodes the answer, e.g. “I’m an adjective!”

Q = — — — Q_{1} Q_{2} ⋮ Q_{n} — — —, K = — — — K_{1} K_{2} ⋮ K_{n} — — —, V = — — — V_{1} V_{2} ⋮ V_{n} — — —

The cosine similarity of a query-key pair encodes “how well the question is answered”
1. each column of the result passes thru a Softmax and Sigmoid function

softmax \frac{Q K ^{T} dot similarity}{nice to compute d _{k}} V

PK's Notes

Explorer

GPT-2 Architecture

Parameters

Hyperparameters

Learnable Parameters

Word Embeddings

Attention

Graph View

Table of Contents

Backlinks

PK's Notes

Explorer

GPT-2 Architecture

Parameters §

Hyperparameters §

Learnable Parameters §

Word Embeddings §

Attention §

Graph View

Table of Contents

Backlinks

Parameters

Hyperparameters

Learnable Parameters

Word Embeddings

Attention