Published: July 8, 2025
36
675
5.2k

How LLMs work, clearly explained (with visuals):

Before diving into LLMs, we must understand conditional probability. Let's consider a population of 14 individuals: - Some of them like Tennis 🎾 - Some like Football ⚽️ - A few like both 🎾 ⚽️ - And few like none Here's how it looks 👇

Image in tweet by Avi Chawla

So what is Conditional probability? It's a measure of the probability of an event given that another event has occurred. If the events are A and B, we denote this as P(A|B). This reads as "probability of A given B" Check this illustration👇

Image in tweet by Avi Chawla

For instance, if we're predicting whether it will rain today (event A), knowing that it's cloudy (event B) might impact our prediction. As it's more likely to rain when it's cloudy, we'd say the conditional probability P(A|B) is high. That's conditional probability!

Now, how does this apply to LLMs like GPT-4? These models are tasked with predicting/guessing the next word in a sequence. This is a question of conditional probability: given the words that have come before, what is the most likely next word?

Image in tweet by Avi Chawla

To predict the next word, the model calculates the conditional probability for each possible next word, given the previous words (context). The word with the highest conditional probability is chosen as the prediction.

Image in tweet by Avi Chawla

The LLM learns a high-dimensional probability distribution over sequences of words. And the parameters of this distribution are the trained weights! The training (or rather pre-training) is supervised. I'll talk about the different training steps next time! Check this 👇

Image in tweet by Avi Chawla

But there is a problem! If we always pick the word with the highest probability, we end up with repetitive outputs, making LLMs almost useless and stifling their creativity. This is where temperature comes into the picture. Check this before we understand more about it...👇

Image in tweet by Avi Chawla

However, a high temperature value produces a gibberish output. Let's understand what's going on...👇

Image in tweet by Avi Chawla

So, the LLMs instead of selecting the best token (for simplicity let's think of tokens as words), they "sample" the prediction. So even if “Token 1” has the highest score, it may not be chosen since we are sampling.

Image in tweet by Avi Chawla

Now, temperature introduces the following tweak in the softmax function, which, in turn, influences the sampling process:

Image in tweet by Avi Chawla

Let's take a code example! At low temperature, probabilities concentrate around the most likely token, resulting in nearly greedy generation. At high temperature, probabilities become more uniform, producing highly random and stochastic outputs. Check this out👇

Image in tweet by Avi Chawla

That's a wrap! If you found it insightful, reshare it with your network. Find me → @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

@_avichawla Great accessible explanation! Thanks for sharing Avi! 🙌

@_avichawla Finally, someone explained LLMs without turning it into a PhD thesis.

@_avichawla Your thread is very popular today! #TopUnroll https://threadreaderapp.com/th... 🙏🏼@EigegeJI for 🥇unroll

@_avichawla Thus, Baye's theorem is the quiet backbone of LLMs. It’s why your next word prediction isn't random but shaped by prior probabilities and evidence (your prompt).

@_avichawla So good.

@_avichawla Great man. I am going to cover it in my next newsletter email. https://www.cooldeep.ai/subscr...

@_avichawla I’m an AI researcher and the CEO of an AI platform called RentPrompts. Hear me out: LLMs are deep learning models that don’t “predict” anything cuss predicting implies foreseeing the future. Instead, they make a “guess” at the next tokens, which we call an “answer.” A guess can

@_avichawla Thanks for the simple explanation 👍 with a mix of probability. If in training data boy never went to café, LLM will never tell that and if majority of time test data takes him to school, then Low temperature will take boy to school every time.

@_avichawla So LLM's always use high temperature to always not choose the same outputs so no low temperature are used then I understood in this way is it chat someone correct if it's wrong I am learning so explain me if I am wrong or understood it wrongly!!

@_avichawla damn, this is one of the best breakdowns of LLMs I’ve seen. Pretty intuitive

@_avichawla Very insightful breakdown of LLMs! Probabilistic modeling concepts like Bayes’ theorem, is key to how they predict and learn.

@_avichawla Insightful

@_avichawla well-explained. thanks Avi!

@_avichawla Great explanation!

@_avichawla @grok can you explain the tweet above?

@_avichawla Great explanation! Now I think I can easily break down LLMs more deeply, just like this.

@_avichawla Does it mean that a low temperature would produce better response, but at the cost of dull repetitive answer?

@_avichawla You put your prompts to a black box. It does some magic and voila there is your answer. Nobody knows how it does what it does. MYSTERY

@_avichawla Very good explanation!

@_avichawla @_avichawla The technicalities are greatly explained

Share this thread

Read on Twitter

View original thread

Navigate thread

1/37