Compression algorithm which is also used for GPT-2 Architecture Byte Pair Encoding Tokenization - YouTube
Constructing a Vocabulary
- Split the text into a , which is simply just split across whitespace & punctuation
- Split the into individual letters
- Add all letters to the set
- Pair up every possible letters from the . Observe its frequency
- Add the most frequent pair into the set as a token.
- Keep track of the merge rules
- Repeat until the set reaches a desired size.
Tokenizing a text using a prexisting vocabulary
Go through the merge rules one by one, any apply greedily, until you can’t anymore. This is enough to tokenize the word.