What is a Foundational Model (LLM)?

A foundational model is a large-scale AI system trained on broad data that can be adapted to many tasks. A large language model (LLM) is one type of foundational model, focused on text and language.

Diagram showing large language models (LLMs) as a subset of deep learning within machine learning.

Source: Google Cloud Skills (YouTube channel)

Foundational models matter because they provide the base layer for applications like chatbots, translation, and search – saving time and resources compared to training models from scratch.

Tokens in Large Language Models

When we talk about large language models (LLMs), it may sound technical, but everything begins with tokens. Tokens are the smallest units of text – words, subwords, or punctuation – that LLMs use to process and generate language.

Think of tokens like LEGO bricks:

Alone, they’re small.
Together, they build sentences, paragraphs, even whole documents.

That’s why tokens are at the heart of natural language processing (NLP), machine learning, and modern artificial intelligence.

How do transformers work in Gen AI technologies

What is a Token?

Tokens are the building blocks of language models, and they come in different forms.

For example:

a full word – “dog”
a subword – “play-ing”
punctuation – “?”

Language Model Tokens – The Puzzle Analogy:

Puzzle analogy showing tokens in a large language model, with words and subwords as separate pieces.

How Tokens Are Used

In machine learning pipelines, tokens are turned into vector embeddings – mathematical forms that neural networks can process.

This allows models to perform:

Classification
Translation
Sentiment analysis

Token Count and Context Window

Every LLM has a maximum token count – the number of text units it can process at once. This limit determines how much information the model can “see” in a single request.

Pros and Cons of More Tokens

More tokens mean:

Pros: Bigger tasks can fit in
Cons: Higher computational costs and slower responses

Context Window Limitations

The context window is like the model’s memory span. Imagine reading a novel and remembering only the last 20 pages:

Everything inside the window – remembered
Everything outside – forgotten

When the model loses context, it may produce hallucinations or break logical flow.

That’s why token count is a trade-off: larger windows expand capability, but also require more resources.

Two Ways to Handle Long Context

To handle longer tasks, models turn to two strategies:

1. Retrieval-Augmented Generation (RAG)

Pulls in external data to extend memory – keeps answers grounded in facts.

Here’s how Retrieval-Augmented Generation works, step by step:

Diagram of Retrieval-Augmented Generation showing user input, embedding model, vector database, and LLM.

Source: Google Cloud Skills (YouTube channel)

2. Multi-step Reasoning

Breaks tasks into smaller steps – makes complex problems easier to handle.
Instead of trying to solve everything at once, the model works through the problem gradually, step by step, until it finds the solution.

The diagram below shows how a complex task is broken down into steps until reaching a final solution:

Flowchart of multi-step reasoning in a large language model, showing a complex task broken into steps leading to a solution.

Next Token Prediction – Core Function

At the center of deep learning, models work by predicting the next token.

How Prediction Works

Input: existing tokens
Output: the most probable next one

It’s like guessing how a friend will finish the sentence:

“I’m going to grab a cup of …” – coffee.

The following diagram illustrates how next token prediction works:

Diagram of next token prediction in a large language model, showing input tokens and probabilities for the next word.

Source: Google Cloud Skills (YouTube channel)

Role of Transformers

This is possible thanks to the transformer architecture, which uses attention to connect tokens and keep outputs clear and consistent.

What is GPT (Generative Pre-trained Transformers)

Randomness, Creativity, and Control

If models always picked the same word, text would sound repetitive.

Temperature

Low temperature – predictable, steady text
High temperature – creative but risky

The diagram below shows how temperature affects text generation in language models:

Temperature scale in a large language model showing low predictability versus high creativity in text generation.

Top-p (Nucleus Sampling)

Narrows choices to the most probable tokens, balancing focus and variety.
Instead of letting the model choose from all possible words, Top-p limits the pool to the most likely ones.
This helps keep text natural while avoiding random or irrelevant words.

RAG for Balance

RAG grounds answers with reliable sources, ensuring creativity doesn’t replace reliability.
In practice, this means the model can “look up” facts in a database or external documents

The result: answers that are both creative and fact-based.

Focused Conversations and System Instructions

To guide AI chatbots and GPT bots in conversational AI, we use:

System Instructions – Define tone and style (e.g., teacher, journalist, assistant).
System Prompts – Narrow down context so answers stay consistent and relevant.

This gives AI a clear “personality” and keeps conversations on track.

This way, large language models can adapt their style, stay consistent, and provide responses that feel natural and useful across different situations.

Tokens in Large Language Models

What is a Token?

How Tokens Are Used

Token Count and Context Window

Pros and Cons of More Tokens

Context Window Limitations

1. Retrieval-Augmented Generation (RAG)

2. Multi-step Reasoning

Next Token Prediction – Core Function

How Prediction Works

Role of Transformers

Randomness, Creativity, and Control

Temperature

Top-p (Nucleus Sampling)

RAG for Balance

Focused Conversations and System Instructions

Related Posts