How Large Language Models Actually Work: Beyond the Hype

Mohssine kissane
Software engineer
Why This Matters
You've used ChatGPT. You've asked it questions, generated code, and maybe even had it write emails. It feels like magic. Sometimes it's brilliant. Sometimes it confidently tells you that 2 + 2 equals 5.
Understanding how LLMs work won't just satisfy your curiosity—it'll change how you use them. You'll know when to trust their output, when to double-check, and why certain prompts work better than others. You'll stop treating them like search engines and start using them like the statistical prediction machines they actually are.
What an LLM Actually Does
An LLM doesn't understand language the way you do. It doesn't think. It doesn't reason. It predicts.
Imagine you're reading a sentence: "The cat sat on the..." Your brain automatically expects "mat" or "chair" or "roof." That's essentially what an LLM does, but at massive scale with billions of parameters trained on enormous datasets.
When you type a prompt, the model breaks your text into tokens—roughly pieces of words. Then it predicts, token by token, what should come next based on patterns it learned during training. "What's the capital" → probably followed by "of" → probably followed by a country name → probably followed by "?" The model doesn't look up facts. It generates sequences of tokens that statistically tend to appear together in its training data.
That's why LLMs are so good at grammar, syntax, and style—these are patterns. It's also why they sometimes generate plausible-sounding nonsense—they're optimizing for what sounds right, not what is right.
The Training Process: Why Size Matters
LLMs are trained in two stages. First is pre-training: the model reads billions of web pages, books, and articles, learning statistical relationships between words. It sees "Paris is the capital of" followed by "France" millions of times. It learns that "function" and "return" often appear together in code. It absorbs patterns without understanding them.
This is why bigger models are generally better. More parameters mean more capacity to memorize patterns. GPT-3 has 175 billion parameters. GPT-4 is estimated to have over a trillion. Each parameter is a tiny piece of learned knowledge—a weight that helps the model decide what word comes next.
The second stage is fine-tuning: the model is taught to follow instructions, refuse harmful requests, and format responses helpfully. This is where the model learns to be an assistant rather than just a text predictor.
Temperature: The Creativity Dial
Temperature is a setting that controls randomness in the model's output. It ranges from 0 to 2, though most practical use stays between 0 and 1.
At temperature 0, the model always picks the most likely next token. It's deterministic. Ask it the same question twice, you'll get nearly identical answers. This is perfect for factual queries, coding, or anything where consistency matters.
At temperature 1, the model samples from a broader range of possibilities. It might pick the third or fifth most likely token instead of always picking the first. This is useful for creative writing, brainstorming, or when you want varied responses.
At temperature 2, output becomes chaotic. The model picks from unlikely options. Responses get weird, incoherent, or completely off-topic.
Think of it like asking someone to finish a sentence. At low temperature, they give you the obvious answer. At high temperature, they throw in unexpected words just to keep it interesting. Too high, and they start speaking gibberish.
Context Window: The Memory Limit
LLMs have a context window—the maximum amount of text they can "see" at once. For GPT-4, it's roughly 128,000 tokens (about 100,000 words). For older models, it was 4,000 tokens.
Everything you've said in the conversation, plus the model's responses, sits in this window. When you hit the limit, the model starts "forgetting" the earliest parts of the conversation. It's not really forgetting—those tokens just fall outside the window and stop influencing predictions.
This is why long conversations sometimes lose coherence. The model might contradict something it said an hour ago because that information is no longer in its context window. It's like having a conversation with someone who can only remember the last 10 minutes.
For developers, this means: keep prompts focused. Don't stuff unnecessary information into the context. Every token counts against your limit.
Tokens: Not Quite Words
When an LLM processes text, it converts it into tokens. A token isn't exactly a word. "Chatbot" might be one token, but "unhappiness" might be two: "un" and "happiness." Punctuation counts. Spaces count.
Why does this matter? Because LLMs are billed by tokens, and their context windows are measured in tokens. A 1,000-word essay might be 1,300 tokens. Understanding tokenization helps you estimate costs and manage context limits.
Tokens also explain why LLMs struggle with certain tasks. Reversing a word letter-by-letter is hard because the model doesn't see letters—it sees tokens. Counting characters is hard because "hello" is one token, not five letters.
Hallucinations: When Confidence Meets Ignorance
LLMs hallucinate—they confidently state false information. They'll cite research papers that don't exist. They'll invent statistics. They'll claim historical events happened differently.
This isn't a bug. It's fundamental to how they work. The model's job is to predict plausible text. If you ask about an obscure topic, and the model has limited training data, it fills gaps with what seems plausible based on patterns from similar topics. It doesn't know when it's guessing—it always generates text with the same confidence level.
How to handle this: Verify anything important, especially facts, numbers, and citations. Use LLMs for brainstorming, drafting, and explaining concepts you already understand. Don't use them as a replacement for research or fact-checking.
Prompt Engineering: Why It Matters
Prompts aren't questions—they're instructions that shape the model's predictions. A vague prompt gets vague results. A specific, well-structured prompt gets better output.
Compare these prompts:
- "Write about dogs" → The model has infinite directions to go
- "Write a 200-word explanation of why dogs make good pets, focusing on loyalty and companionship, for an audience of children aged 8-10" → The model knows exactly what you want
The difference isn't intelligence. It's context. The second prompt narrows the probability space the model is predicting within. It's like saying "finish this sentence, but it has to be about science fiction, use simple words, and be optimistic in tone."
Good prompts give examples, specify format, set tone, define length, and clarify audience. They work because they steer the model's statistical predictions in useful directions.
Limitations: What LLMs Can't Do
LLMs can't reason in the traditional sense. They can't do math reliably—they've memorized common calculations and patterns, but novel arithmetic often fails. They can't access the internet in real-time (unless specifically integrated with tools to do so). They don't know today's date unless you tell them. They can't learn from your conversation—each chat starts fresh from the same pre-trained model.
They're also biased because they're trained on internet text, which contains human biases. They can be manipulated with adversarial prompts. They sometimes refuse harmless requests because their safety training is overtuned.
These aren't problems that'll be fixed in the next version. They're inherent to the architecture. Future models will be better, but they'll still be prediction machines, not thinking machines.
Your Next Step
Stop treating LLMs like search engines or oracles. Start treating them like probabilistic text generators that are very good at pattern matching.
When you need facts, verify them. When you need creativity, increase the temperature. When you need consistency, lower it. When you're hitting context limits, summarize and start fresh. When you get bad output, refine your prompt—be more specific, give examples, clarify what you want.
LLMs are powerful tools, but only if you understand what they're actually doing under the hood.
Remember: An LLM's confidence level never reflects its accuracy. It will state "2 + 2 = 5" with the same certainty it states "Paris is the capital of France." Your job is to know the difference.