In this post, I will summarize my notes from the 2nd lecture of the Zero to Hero course by Andrej Karpathy, which concerns building makemore, a bigram character-level language model.
Count-based makemore
Makemore is a probabilistic model based on frequency counts. It aims to predict the next character based on the previous character.
Specifically, the model uses the distribution of character pairs (aka bigrams), learned from counting how often each pair appears in a dataset — in our case, one consisting of real names.
It then predicts the next character until it picks an “end” character, finalizing the generation of a name.
In a nutshell, here’s how it’s created:
1. Pair all the characters from the alphabet, plus start and end characters (special characters that indicate the beginning and the end of a name), obtaining the complete set of bigrams. For example, in the name Emma, there are 4 bigrams:
('<S>', 'e')
('e', 'm')
('m', 'a')
('a', '<E>')
2. Count the occurrence of each bigram across all names in the training set (e.g. How many times does the pair (‘e’, ‘m’) appear in our dataset? At least 2, if our data includes “Emma” and “Emily.”). The results will then be used to create an array of raw frequency counts.
3. Normalize all the counts by the total number of bigrams, generating a matrix P that contains the probabilities for each pair. Each probability indicates how likely each pair of characters is based on the bigrams existing in the names in the training set. Furthermore, each value in P is considered a model’s parameter.
4. Use PyTorch’s multinomial
function to sample characters according to the probabilities in matrix P. Since the sampling is probabilistic, the model doesn’t always pick the most likely character. Instead, it samples randomly from the distribution, which introduces variability in the generated names.
In the step-by-step, you might have noticed that matrix P is already “trained,” as it reflects the exact probabilities of each bigram in our dataset.
What if P isn’t trained? What if, instead of calculating the probabilities (model parameters) manually, we want to use an optimizable model that can take our training data and learn P — the conditional probabilities of each bigram pair — automatically?
Optimizable makemore
In this approach, we first have to define a loss function, so we can measures how well the model fits the training data (the dataset of names).
For bigram models, this loss is based on Maximum Likelihood Estimation (MLE). MLE aims to find the set of parameters that maximize the likelihood of the observed training data under the model. (Remember, the parameters are the probabilities for each bigram gathered in the matrix P!)
Given a sequence of words, the probability of the whole dataset will be the multiplication of the probabilities of each pair. The optimization process will then focus on learning how to assign high probabilities to likely bigrams and lower probabilities to unlikely ones, increasing the likelihood of the dataset as a whole. The output will be a refined P, reflected in a minimized loss.
In practice, we use the negative log-likelihood (NLL) loss. The reason is that by using the log of the likelihood, we avoid the problem of multiplying small probabilities, and by using the negative values, we follow the standard loss semantics of “the lower, the better.”
- So, when the model assigns a high probability to a probable bigram, the negative log probability will be close to 0, indicating a good fit.
- Otherwise, the negative log probability will be a large number, indicating a poor fit.
Smoothing
In both cases, there’s a problem: what if we want to calculate the probability the model will assign to an uncommon name? What if this name has a bigram that’s not present in our training dataset?
For example, consider the name “Andrej,” which contains the bigram “ej.” If this pair hasn’t appeared in the training data, its assigned probability would be zero, leading the model to assign the entire name a probability of zero. Considering log probabilities, this would result in negative infinity, causing the model to break.
To avoid this, we can apply smoothing techniques. A common approach is called Laplace smoothing: adding a small default count (e.g., 1) to all bigrams, even if they didn’t appear in the training data. This ensures that every possible bigram has a non-zero probability.
The effect of this is to make the probability distribution more uniform — unseen bigrams receive small, non-zero probabilities, while frequently observed bigrams can still stand out.
The degree of smoothing can be controlled by adjusting the default count. A higher count makes the probabilities more similar to each other, increasing the smoothness.
Attention: while this can help the model generalize better to unseen data, too much smoothing reduces its ability to differentiate between common and uncommon bigrams.
Neural-based makemore
Cool. Now that we understand how to create a bigram model, we can try to model the same problem with a very simple neural model (why not, right?).
The process will be slightly different than what we’ve seen so far, but the overall goal will be the same (and spoiler: the results will also be the same). Here’s the step-by-step:
1. Create a training set of bigram pairs, represented by character indexes. They are usually indexed based on their alphabetical order. Each (x, y) pair consists of:
- x : the index of the current character (e.g., the start symbol <s> has index 0, as it’s the first).
- y: the index of the next character (e.g., ‘e’ has index 5, as it’s the 5th letter of the alphabet).
- For example, for the name “Emma”, one of the training pairs is (x = 0, y = 5) , indicating that after the start symbol (<s>), the next character is ‘e’.
- Here’s the full (x-y) index pairing for the name “Emma”:
<s> (0) → 'e' (5)
'e' (5) → 'm' (13)
'm' (13) → 'm' (13)
'm' (13) → 'a' (1)
'a' (1) → <E> (27)
3. Convert the indexes to one-hot encoded vectors. One-hot encoding transforms each character index (categorical information) into a binary vector.
- For example, with a vocabulary size of 28 (including all letters and start/end tokens), a one-hot vector for
<s>
(index 0) would be[1, 0, 0, ..., 0]
and for 'e' (index 5), it would be[0, 0, 0, 0, 0, 1, ..., 0]
.
4. Feed the one-hot vectors into the model, which will then perform the dot product between X, a matrix gathering all these one-hot inputs, and a weight matrix W (X @ W). This may start to look a bit complicated, but bear with me.
- X contains the one-hot encoded characters, with each row representing an input character and the columns representing the length of the vocabulary.
- W is the weight matrix, where the rows match the vocabulary size, and the columns represent the weights of the neurons in the layer.
- The result of X @ W is a matrix representing the activations of each neuron for each input character.
Due to the one-hot encoding, each input character is actually selecting a particular row from the weight matrix W.
Remember: these weights, learned during training, aim to capture the same relationships as our original bigram probabilities.
So, multiplying the one-hot vector by W actually works like a lookup, retrieving the relevant row for the character, which indicates the frequencies of all the bigrams that start with that character.
5. Obtain the probabilities for the bigrams. The entries in the resulting matrix X @ W will actually be the equivalent of the log counts, also named logits, of each pair.
- When exponentiated, logits return (in the case of our bigram model) the equivalent of the raw counts. To obtain the probabilities, we then just normalize them by the total number of bigrams as we did before.
- For this, we use the softmax function: it applies exponentiation followed by normalization to convert logits into probabilities.
6. Adjust the probabilities (aka tune the network’s weights) with backpropagation and gradient descent based on their impact over the NLL loss. If you’re not familiar with these concepts, please take a look at this post.
- This step is repeated for n training rounds. At each round, the loss (the network’s error) will (ideally) be minimized, reducing the network’s error.
- The lower the loss, the more we know the network is assigning high probabilities to the correct next characters, maximizing the likelihood of our data under the neural model.
And that’s it! We’ve just trained a neural network to predict the next character based on a previous character. Here’s a summary in code:
- The first forward pass for a couple of bigrams in Emma:
- The training loop:
Smoothing in neural networks
But you might have a final question: what about smoothing? Can we also apply it in our neural model?
The answer is yes. The same operation of adding base counts to all bigram frequencies has an equivalent in gradient-based optimization, and that is incentivizing the weights to have similar values, and that those values should be close to 0.
- The closer the weights are to 0, the more the logits (“raw frequency counts”) will approach 0. When exponentiated, these values will become close to 1, making the probabilities between bigrams more uniform.
- For this, we introduce a regularization parameter in the loss function that penalizes large weight values, encouraging them to stay close to 0.
And that’s it! How incredible, right?
Before wrapping up, here is a quick (awesome) parenthesis: researchers have used a similar approach a while ago to model conditional probabilities between medical features from Electronic Health Records:
See you next time.