A foundational model is a large-scale AI system trained on broad data that can be adapted to many tasks. A large language model (LLM) is one type of foundational model, focused on text and language.

Source: Google Cloud Skills (YouTube channel)
Foundational models matter because they provide the base layer for applications like chatbots, translation, and search – saving time and resources compared to training models from scratch.
Tokens in Large Language Models
When we talk about large language models (LLMs), it may sound technical, but everything begins with tokens. Tokens are the smallest units of text – words, subwords, or punctuation – that LLMs use to process and generate language.
Think of tokens like LEGO bricks:
- Alone, they’re small.
- Together, they build sentences, paragraphs, even whole documents.
That’s why tokens are at the heart of natural language processing (NLP), machine learning, and modern artificial intelligence.
What is a Token?
Tokens are the building blocks of language models, and they come in different forms.
For example:
- a full word – “dog”
- a subword – “play-ing”
- punctuation – “?”
Language Model Tokens – The Puzzle Analogy:

How Tokens Are Used
In machine learning pipelines, tokens are turned into vector embeddings – mathematical forms that neural networks can process.
This allows models to perform:
- Classification
- Translation
- Sentiment analysis
Token Count and Context Window
Every LLM has a maximum token count – the number of text units it can process at once. This limit determines how much information the model can “see” in a single request.
Pros and Cons of More Tokens
More tokens mean:
- Pros: Bigger tasks can fit in
- Cons: Higher computational costs and slower responses
Context Window Limitations
The context window is like the model’s memory span. Imagine reading a novel and remembering only the last 20 pages:
- Everything inside the window – remembered
- Everything outside – forgotten
When the model loses context, it may produce hallucinations or break logical flow.
That’s why token count is a trade-off: larger windows expand capability, but also require more resources.
Two Ways to Handle Long Context
To handle longer tasks, models turn to two strategies:
1. Retrieval-Augmented Generation (RAG)
Pulls in external data to extend memory – keeps answers grounded in facts.
Here’s how Retrieval-Augmented Generation works, step by step:

Source: Google Cloud Skills (YouTube channel)
2. Multi-step Reasoning
Breaks tasks into smaller steps – makes complex problems easier to handle.
Instead of trying to solve everything at once, the model works through the problem gradually, step by step, until it finds the solution.
The diagram below shows how a complex task is broken down into steps until reaching a final solution:

Next Token Prediction – Core Function
At the center of deep learning, models work by predicting the next token.
How Prediction Works
- Input: existing tokens
- Output: the most probable next one
It’s like guessing how a friend will finish the sentence:
“I’m going to grab a cup of …” – coffee.
The following diagram illustrates how next token prediction works:

Source: Google Cloud Skills (YouTube channel)
Role of Transformers
This is possible thanks to the transformer architecture, which uses attention to connect tokens and keep outputs clear and consistent.
Randomness, Creativity, and Control
If models always picked the same word, text would sound repetitive.
Temperature
- Low temperature – predictable, steady text
- High temperature – creative but risky
The diagram below shows how temperature affects text generation in language models:

Top-p (Nucleus Sampling)
Narrows choices to the most probable tokens, balancing focus and variety.
Instead of letting the model choose from all possible words, Top-p limits the pool to the most likely ones.
This helps keep text natural while avoiding random or irrelevant words.
RAG for Balance
RAG grounds answers with reliable sources, ensuring creativity doesn’t replace reliability.
In practice, this means the model can “look up” facts in a database or external documents
The result: answers that are both creative and fact-based.
Focused Conversations and System Instructions
To guide AI chatbots and GPT bots in conversational AI, we use:
- System Instructions – Define tone and style (e.g., teacher, journalist, assistant).
- System Prompts – Narrow down context so answers stay consistent and relevant.
This gives AI a clear “personality” and keeps conversations on track.
This way, large language models can adapt their style, stay consistent, and provide responses that feel natural and useful across different situations.



