def. GPT-2 is OpenAI’s first public Large Language Model

Note

Parameters

Hyperparameters

  • Context Length: number of tokens allowed in input
    • Sequence length < Context Length. Sequence is the actual length of text input
  • Vocabulary Size: the tokenizer’s size of vocabulary
  • Embedding Dimension: the number of dimensions in the word embedding
  • Head Count: number of attention heads in each attention layer

Learnable Parameters

  • wte (Vocab Size → Embedding Dimension): lookup table from token to embedding
  • wpe (Context Length → Embedding Dimension): lookup table for positional embedding
  • in the Normalization layer
  • w,b (N → N), (N): Weights and biases in the linear (=linear) layers
  • Attention
    • Query, Key, and Value matricies, each (Sequence Length → Embed. Dim.):

Word Embeddings

  1. Given a string of text, a tokenizer will split this text into chunks (tokens)
  2. Learnable embedding matrix wte and wpe encodes this into an embedding These vectors , embedded in the embedding space will:
  • Be similar in meaning if cosine similarity is larger
  • Difference in vectors i.e. the direction embed meaning that is the “difference in meaning” between and

Attention

  1. Each token in string is multiplied with a Query matrix
    1. The resulting vector encodes the question, e.g. “are there adjectives in front of me?”
    2. The resulting vector
  2. Each token in string is multiplied with a Key matrix
    1. The resulting vector encodes the answer, e.g. “I’m an adjective!”
  1. The cosine similarity of a query-key pair encodes “how well the question is answered”
    1. each column of the result passes thru a Softmax and Sigmoid function