The neural probabilistic language model

Elo Hoss
7 min readOct 10, 2024

--

In our previous post, we discussed count-based and simple neural bigram models. The problem with both is that they only consider the immediate past character to predict the next. This small context window hinders the predictive and generative abilities of the models.

What if we expanded the frequency matrix we’ve been calculating to consider not only bigrams (pairs of characters) but also trigrams, quadrigrams, etc.? Wouldn’t that solve the problem?

Maybe. However, this approach makes another problem emerge: the size of the matrix grows exponentially with the length of the context, making its use unfeasible from a memory point of view.

Today, we will talk about a more complex model for predicting the next character based on a multi-layer perceptron architecture (aka MLP), specifically the one introduced by Bengio et al. (2003).

The notes here were derived from Andrej Karpathy’s 3rd lecture on neural networks. If you’re interested, you can watch it here and use this article as a reference for a better understanding of the concepts. Let’s start.

MLP model overview

As mentioned, MLP stands for multi-layer perceptron, meaning a network based on multiple artificial neurons organized across multiple layers.

An important detail is these neurons are fully connected (e.g. neuron 1 from layer 1 is connected to all 6 neurons from layer 2 etc.).

As neurons perform linear combinations of the inputs, this alone isn’t expressive enough to solve the problems we want because the model is limited to learning only linear functions.

So, in MLPs, the neuron’s outputs are transformed by the application of nonlinear functions, such as tanh. This output is then passed as an input to the following layer, and so on.

This operation allows for modeling non-linear relationships in the data, resulting in more complex decision boundaries that enable us to solve difficult tasks such as predicting the next character.

Here’s an illustration of an MLP with three layers:

The MLP. The bias term isn’t shown in the diagram. Source: AIML.

The neural probabilistic language model

The neural probabilistic language model, introduced by Bengio et al., is an MLP. It aimed to predict the next word based on the three previous words by minimizing the negative log-likelihood (NLL) loss, the same one we’ve seen in the previous bigram models.

Their vocabulary consisted of 17,000 words. Imagine implementing a count-based bigram model for a vocabulary of this size! This issue was solved with a smarter approach based on compact word representations.

Specifically, each of the 17,000 words was initially associated with a unique, randomly initialized 30-dimensional feature vector (aka embedding). This helped avoid the curse of dimensionality that we were talking about in the intro.

These embeddings then formed the network’s pairs of inputs and labels.

  • For example, for the sentence “The dog was walking,” the three inputs were the 30-D vectors for “the,” “dog,” and “was,” and the label was the 30-D vector of the fourth word, “walking.”

Then, the optimization goal of the network wasn’t only to learn the best weights and biases for the model but also to tune the embeddings with backpropagation.

As the model learned the semantic relations of concepts with training, words with similar meanings ended up close in the learned embedding space. This allowed it to predict the next word in sentences it had never seen before (out-of-distribution cases).

For example, imagine it wants to predict “the cat is running in a […],” but it never saw this phrase. However, during training, it saw “a dog was running in a room” and “the cat was walking in a room.” Instead of breaking, it can use the knowledge inherent in the embedding space to generalize to this new scenario:

  • The embeddings for “a”/”the” are similar, as they’re often used interchangeably.
  • The same goes for “cats”/”dogs,” as both are animals and frequently co-occur.

Based on this, the model can infer that the next word is probably “room.”

Here’s a picture to illustrate a trained embedding space. You can notice animals are embedded close together, while fruits and brands have their own separate clusters.

An example of a trained embedding space. Source: Multimodal Embeddings Gcp Overview.

Building an MLP for predicting characters

Nice! Let’s now consolidate the MLP knowledge, adapting the architecture to our character-prediction case. The dataset creation will follow the same idea from the bigram models.

First, we load the same dataset of names and map each one of the 28 characters (alphabet + start and end characters) to their respective indexes. Here’s the process in code:

Creating the “three preceding characters-next character” training pairs for the name “Emma.” This is repeated for the other names in the dataset.
The inputs X and label Y. Each input in X consists of three character indexes, and each output in Y is the character index that comes after in the real name.

Now, let’s start dissecting the MLP architecture by untangling the figure from the original article. Our model has three layers:

The neural probabilistic language model architecture. Source: Anri Lombard’s GitHub.
  1. Input layer

The blue boxes represent the three inputs of the network. They’re the one-hot encoded indexes of each of the previous three characters (if you’re not familiar with one-hot, look at this post).

Instead of processing a single example at a time, we will be using batches of 32 examples, so this input will consist of a matrix X with a size of 32x3x28.

  • Each of the 32 examples has 3 characters, and each character has an associated 28-D one-hot encoding.

Matrix C is a tunable parameter and gathers the embeddings of all of the 28 characters. In our example, these will initially be 2-dimensional, so C is 28x2.

When we multiply C by the one-hot indexes, we perform a lookup operation, plucking out the respective embeddings of the characters. These three embeddings are then concatenated, passing a single vector as an input to the hidden layer.

  • For example, if we have three embeddings of size 2 each, the concatenated input vector has size 6 (3 * 2 = 6).
  • If the input batch contains 32 sequences, the input tensor shape will be (32, 6), where each row represents a sequence of 3 character embeddings for each example.

Hidden layer

In our model, we will have one hidden layer with randomly initialized weights and bias (W1, b1).

  • W1’s size is determined by the number of input features (6, since we are concatenating three character embeddings of size 2 each) and the number of neurons in the hidden layer (e.g., 100 neurons), having a shape of (6, 100).
  • b1's size will be equal to the number of neurons, which is (100,).

First, the input data, which has a shape of (32, 6), is multiplied by the weight matrix W1.

After, the bias b1 is added to each output, generating an intermediate result of size (32, 100), which gives the 100 outputs for each of the 32 examples in the batch.

These outputs are then passed through a non-linear activation function like tanh, squashing the outputs to the [-1, 1] range.

Output layer

The output layer takes as input the activations from the hidden layer and outputs the probabilities for each of the 28 characters in the dataset. Similar to the hidden layer, it has its own weight matrix W2 and bias vector b2.

  • W2 connects the 100 activations from the hidden layer to the 28 possible characters, having a shape of (100, 28).
  • b2 has a shape of (28,).

The activations from the hidden layer are multiplied by the weight matrix W2 and added to the bias vector b2. This produces a matrix of shape (32, 28), where each row contains the raw scores (logits) for the 28 possible output characters.

To convert the 28 logits into probabilities, we apply the softmax function, which exponentiates and normalizes them so that they sum to 1. This results in a matrix of probabilities with shape (32, 28).

For each example in the batch, we now have a set of probabilities that indicate how likely each character is to be the correct next character, given the input sequence.

Parameter adjustment

We know the labels (the correct next character) for each of the 32 examples. Using the indexes in Y, we pluck out the probabilities assigned by the model to each respective correct character from the 28 probabilities.

The loss function we’ve chosen, the negative log-likelihood, compares the model’s assigned probability to the correct character. The goal is to maximize this probability, as the NLL loss penalizes low probabilities for the correct prediction. By minimizing the loss, we maximize the likelihood of the correct output under the model.

We then compute the gradients of the loss with respect to the model’s parameters and use backpropagation to update them (more about backprop here). This process adjusts the weights and biases in a direction that reduces the loss over training iterations.

A summary in code

Calculating the first forward pass. The total loss is equal to 17.76, which is pretty bad, as expected in the initial stages of training.
The training loop with the loss at each step. With time, it’s minimized, meaning the model maximizes the probabilities of the correct next characters. Beautiful!

After some training rounds, the 2-D embedding space starts to look nice. Similar letters, like vowels, are grouped together, while less frequent letters and special marks are more spread out:

And that’s it! We’ve just trained an MLP for predicting the next character.

We can do some cool things with it, like generating new names. They’re not perfect yet, but they’re definitely better than the ones generated by the simple bigram models:

Generating names with the trained MLP model.

See you next time!

--

--

Elo Hoss
0 Followers

Sharing bits of artificial intelligence and machine learning concepts, with a focus on healthcare.