Compression algorithm which is also used for GPT-2 Architecture Byte Pair Encoding Tokenization - YouTube

Constructing a Vocabulary

  1. Split the text into a , which is simply just split across whitespace & punctuation
  2. Split the into individual letters
    1. Add all letters to the set
  3. Pair up every possible letters from the . Observe its frequency
    1. Add the most frequent pair into the set as a token.
    2. Keep track of the merge rules
  4. Repeat until the set reaches a desired size.

Tokenizing a text using a prexisting vocabulary

Go through the merge rules one by one, any apply greedily, until you can’t anymore. This is enough to tokenize the word.