Byte-Pair Encoding

Split the text into a $Corpus$ , which is simply just split across whitespace & punctuation
Split the $Corpus$ into individual letters
1. Add all letters to the set $Vocab$
Pair up every possible letters from the $Corpus$ . Observe its frequency
1. Add the most frequent pair into the set $Vocab$ as a token.
2. Keep track of the merge rules
Repeat until the set $Vocab$ reaches a desired size.

Go through the merge rules one by one, any apply greedily, until you can’t anymore. This is enough to tokenize the word.

PK's Notes