GPT stands for generative pretrained transformers. It is a type of language model which has been trained on a very large amount of text. It is capable of generating highly fluent text, making it applicable to a variety of tasks. “Generative” means GPT can generate something for you. “Pretained” refers to the fact that the model is initially trained on some data and then can be adapted to some other data. And “transformers” refers to a specific neural network architecture.

Language Models

An important fundamental idea of language models is that they measure the likelihood of text with respect to some reference text. Specifically, will they provide a probability for a text sequence (like a sentence) based on a corpus of text that it uses to estimate that probability. We can write this in probability notation P, which is a function that gives a value between 0 and 1:


Language models are helpful for generating text. To actually generate a text, we need a specific formulation of language modeling called Causal Language Modeling. Causal language models measure the probability of the next word in a specific sequence of text. GPT is a type of causal language model. We can represent these using probability notation too as a conditional probability:

Pcorpus(next_word | text_sequence)

Besides, there are other modeling techniques like Masked Language Modeling. However GPT does not use it. Masked language models measure the probability of a word masked out in text.

Pcorpus(mask_word | "A text sequence ... [MASK] ...")

N-gram Language Models

One possible option of measuring the possibility is to compute the maximum likelihood estimation (MLE) of the exact sequence.

Pcorpus(text_sequence) = countcorpus(text_sequence) / length(corpus)

However, it doesn’t generalize particularly well to new text. There is a good chance that the probability is zero, i.e. the text_sequence does not exist in the corpus at all. And a small variation to the text_sequence could lead to dramatic change in the probability.

Instead of measuring the probability of the whole text_sequence all at once, we could measure the probability of the sequences of individual words. The chain rule of probability shows us how we can actually decompose the sequence into the product of conditional probabilities instead.

Pcorpus(word1, word2, ... wordn)
= Pcorpus(word1)
* Pcorpus(word2 | word1)
* Pcorpus(word3 | word1, word2)
* ...

These conditional probabilities are actually in the exact same form as causal language models. Again we can use the maximum likelihood estimation (MLE) to calculate these conditional probabilities.

Pcorpus(wordn | context)
= countcorpus(wordn + context) / countcorpus(context)

context = word1, ..., wordn-1

This gets harder when the context becomes longer, as the sequence of the words in the context becomes more and more unlikely. To get around this problem, one option is to only consider the most recent words. This is called Markov Assumption. In other words, we’ll assume the probability of a particular word given a long context is approximately equal to that given a short context.

Pcorpus(wordn | contextlong)
โ‰ˆ Pcorpus(wordn | contextshort)
= countcorpus(wordn + contextshort) / countcorpus(contextshort)

This should be far easier to estimate because the sequence of words in contextshort occur much more frequently in a corpus. This formulation is called an N-gram language model with N-1 words in the contextshort.

Evaluation of Models

You should start thinking about your evaluation technique even before you build your model. In general, you want a language model to be able to generalize to new and unseen text. To evaluate a language model, there’s two main techniques:

Intrinsic evaluationMeasures how well a particular language model estimates the probability of some held out text.
In advance we split off some data for training purposes and some data for evaluation purpose.
Using evaluation data, you can calculate Perplexity to measure how surprising the text is:

Perplexity = 1 / ( nโˆš( โˆni=1 P(wordi | word1, ..., wordi-1) ) )
where n is the length of the sequence
Low perplexity โ†’ the language model gave high probability for the text. It is good (wasn’t surprised by the text).
High perplexity โ†’ the language model gave low probability for the text. It is bad (was surprised by the text).

In practice, to avoid floating point number underflow, we can instead use this equivalent formulation that involves adding the base-2 log probabilities:
Perplexity = 2 ^ [( -โˆ‘ni=1log2P(wordi | word1, ..., wordi-1) ) / n]
Extrinsic evaluationMeasures how well the model performs at some other task, because people often care about how well the language model is actually doing at other tasks rather than modeling language directly.

Generating Text

One strategy for generation is to just perform a greedy search. The idea here is to iteratively generate the most likely word given the previous text. However it’s a simple and deterministic approach, it can often generate pretty bland or uninteresting text. An approach called beam search mitigate some of these problems. Beam search maintains K highest probability sequences where K is usually somewhere 2-20 sequences. But still it is deterministic. Another strategy is called sampling. Sampling just randomly chooses the next word, weighted by probability. Sampling isn’t deterministic.


Computers store text as individual letters (numerical numbers), it’s not a useful representation for us to actually deal with the text and interpret the meaning. We need an useful method that words with similar meanings have a similar representation. One idea for that is word embeddings, or called word vectors, where words with similar actual meanings have similar vectors. Then we can think of the similarity as the problem in geometry. Different systems that have different purposes may use different vectors.

We also need to consider context, because one word may have different meanings under different contexts. If we use the same word vector for a word in all of those contexts, that wouldn’t really work. We come across the idea of context vectors, or contextualized word vectors. Context vectors are very useful for language modeling for predicting which word would come next.

For a computer to actually process text, it needs to first break it down into some meaningful units that it can work with. These units can be words, tokens, or even sub-tokens. Moreover, sometimes tokenizers for deep learning models include some special tokens that are used for special mechanisms, for instance, <|endoftext|> to identify when processing should stop at the end of text. Once we’ve tokenized text into these tokens or sub-tokens, the goal is still to create context vectors for tokens and sub-tokens, where tokens and and sub-tokens that have similar meanings have similar context vectors.

Neural Language Models

Neural language models have the same basic goals as traditional language models, that’s calculating the probability of the next token after a sequence of tokens. But the difference is that neural language models use neural networks.

The input of language models is text, after tokenizing and representing them as word vectors, we still don’t know anything about the context. And then the goal of the language model is to somehow combine those word vectors together into some useful form.

Language models are usually used to predict the next tokens or missing tokens. Typically the output of language models are scores of each of the possible tokens. Those scores are often unbounded. Higher ones more likely, lower ones less likely. You want to turn them into something that looks like a probability distribution. That’s where the softmax function comes in.

softmax(z) = ezi /  โˆ‘Kj=1 ezj

After applying softmax, all the numbers are now bounded 0-1, and the sum of all of them adds up to one.

Gradient descent and back propagation are used to derive model weights / embeddings. The idea is:

  1. Initialize all parameters with random numbers
  2. Input data into our neural network
  3. Examine the output (which will likely be wrong)
  4. Adjust the parameters slightly so that the output is slightly less wrong

We do this a lot very iteratively and slowly. The weights and embeddings will move towards being useful and having some useful interpretations.

The Transformer Architecture

The Transformer architecture has taken over as the most successful approach for a lot of language problems. The Self-attention and tokenization are brilliant innovations behind its success, and another reason is the huge effort of building the massive corpora of text. Previous language modeling approaches worked serially (one word at a time), however Transformer architecture can parallelize very well.

Neural language models typically store a lookup of vectors for every token, however they don’t represent the context that the token appears in. We need a mechanism that takes those uncontextualized vectors and makes them context vectors. How we do this is thinking about which words are important in a sentence to help us understand that meaning. For example, the word “match” has many meanings, which depends on the context that it appear in.

The         match       burns       brightly
wordvec01   wordvec02   wordvec03   wordvec04

Now in this case, the context “burns brightly” tells us the word “match” is probably the device made of wood or paper. We can create a context vector for the word “match” to present this meaning.

First of all, we’re going to need a scoring function (also called a relevance function) for how useful each of these words “the”, “burns”, and “brightly” are for understanding the word “match”. Using the scoring function we calculate the relevance of one word (say, “burns”) for the context of another word (say, “match”). The inputs to this scoring function are going to be those word vectors that don’t have context on them yet.

relevance("the")                                          = score00  e.g. 0
relevance("match" | "the")      = G(wordvec02, wordvec01) = score01  e.g. 12.1
relevance("burns" | "match")    = G(wordvec03, wordvec02) = score02  e.g. 89.3
relevance("brightly" | "burns") = G(wordvec04, wordvec03) = score03  e.g. 0

The relevance scores can be transformed with the softmax function, making them between 0 and 1, and add up to 1.

score00 = 0.00
score01 = 0.12
score02 = 0.88
score03 = 0.00

Then the relevance scores are used to weigh the (potentially transformed) word vectors. That is the context vectors are a linear combination of the (potentially transformed) word vectors, with the weighing is the relevance score.

= 0.00 * wordvec01
+ 0.12 * wordvec02
+ 0.88 * wordvec03
+ 0.00 * wordvec03

This is what is called the Self-attention Mechanism, which is used by one transformer in the Transformer architecture to look at the other words in the sentence and decide which ones are important. Besides, one transformer also need to include

  • Information about the position of each token
  • A full feed forward network to enable non-linear behavior
  • Multi-head attention (do self-attention for multiple times)

We stack multiple layers of transformer to build a Transformer architecture, with lower layers are likely dealing with basic syntax, and higher layers are dealing with more complex reasoning.

The training of the Transformer involves the gradient descent and back propagation. Generally speaking, the process is:

  1. Provide input and expected output, depending on the language modeling task that we are focusing on
  2. Run input through the Transformer and compare the output to the expected output
  3. Adjust weights in neural network to get output closer to the expected output

A few examples of the language modeling tasks include:

Causal language modelingPredict the next wordGPT
Masked language modelingPredict a masked wordBERT
Next sentence predictionDoes one sentence follow the other?BERT
Replace token detectionSpot the corrupted word

Human-created text is used to create examples for language modeling tasks. Document collections (corpora) are getting enormously huge. They may contain multiple languages, and even programming languages. However there are issues with getting “good text” that is not blatantly problematic, such as hate speech.

The key difference between BERT and GPT is that GPT is looking at causal language modeling (to predict the next token), and BERT is looking at masked language modeling (looking all of the text and figure out what the missing token probably is).

The text given to a causal language model is known as a prompt. For example: “The capital of France is …” However some language models have been trained to deal well with instructions and questions. For example: “Write a paragraph about the capital of France.”

Language models have some interesting abilities that are often known as few-shot or zero-shot:

Few-shotIn addition to the task description, the model sees a few examples of the task. No gradient updates are performed.
Zero-shotThe model predicts the answer given only a natural language description of the task. No gradient updates are performed.

It is important to thinking about language models that they are “parrots”, they have looked at a lot of texts, but there is no rationalization in here. They are statistical machines that have become very good at modeling the next token probabilities. They don’t really understand what’s going on in the text, they don’t understand at all. Often they’ve seen the sentence before, and they are regurgitating it to you.


Hallucinations in the context of language modeling means a confident response given by the model actually does not seem to be justified by its training data.

Language models like GPD are not trained to generate something that’s necessarily true, but they are trained to predict which word comes next, in other words, they are trained to generate something that’s likely based on the previous context, their training data, and any fine tuning. Hallucinations may arise from model’s attempt to find patterns in data, even where none exist, trying to extrapolate from something. Possible causes of hallucinations are:

  • Overfitting to training data
  • Exposure to biased or misleading data
  • Complex architectures which may lead to unpredictable outputs

What language models generate often matches reality, but there is no guarantee. Some types of hallucination include:

  1. Incorrect information generation
  2. Creating fictional details
  3. Elaborating on input without factual basis
  4. Mixing and matching content from diverse sources

It is obvious that these hallucinations as generating text not grounded in reality can actually have serious consequences. Human-in-the-loop verification is probably the most immediately useful thing for mitigating the implications of these hallucinations.

For more on GPT: Generative Pretrained Transformers, please refer to the wonderful course here

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!