How do transformers work in Gen AI technologies?

How do machines understand human language?

Machines use natural language processing (NLP) and neural networks trained on large datasets to break text into tokens and understand their meaning.

Instead of “reading” like humans, these systems look for patterns and probabilities in language, which allows them to translate, answer questions, or even generate natural-sounding text.

The image below shows how text representation evolved, leading to the use of transformers:

Evolution of text models ending with transformers in 2017.

Source: Google

What are the NLP techniques used to understand human language

To understand human communication, machines use a few key NLP techniques:

  • Embedding – turning words into vectors so that models can capture meaning and context.
  • Text classification – organizing text data into categories, often with supervised learning methods.
  • Next-word prediction – using deep learning to suggest the next words in a sentence.

Together, these techniques form the basis for tools we use every day, such as chatbots, translation systems, and sentiment analysis engines.

1. Embedding

Embedding turns words or phrases into dense vector representations that capture their meaning and context. 

For example, dog and cat appear close to each other in this vector space because they often share similar contexts, showing the model their semantic relationship:

Word embedding example used in transformers showing semantic similarity.

2. Text classification

Text classification uses machine learning methods such as neural networks, logistic regression, or support vector machines to organize text data into predefined labels.

It is often used for:

  • Spam filtering
  • Sentiment detection
  • Topic categorization

3. Next-sentence and next-word prediction

Next-word prediction and next-sentence prediction are powered by transformers and other deep learning models that learn from large datasets to predict the next words in a sentence.

Some everyday applications are:

  • Search autocomplete
  • Predictive typing on phones
  • Chatbots with natural replies

Vector representations of input text

For machines to process text, words and sentences must be turned into vector representations. These vectors encode tokens in a way that keeps their context and meaning. 

This lets computers handle language almost like numbers they can work with.

Comparison of Token, Segment, and Position Embeddings

ConceptWhat it isWhy it matters
TokenSmallest unit of text (word or subword)Base unit for text processing
SegmentGroup of tokens into sequencesHelps separate sentences/paragraphs
Position embeddingsAdd order information to tokens in transformers
Ensure correct meaning and sentence flow
Transformers: BERT input embeddings combining token, segment, and position vectors.

Source: Google

Context window and additional concepts in transformers

Transformers depend on a context window and advanced mechanisms such as multi-head attention, which allows models to focus on multiple parts of a sentence at the same time. 

Together, these elements create cutting-edge language understanding, enabling models to process large amounts of text quickly and effectively.

The diagram below shows the encoder–decoder structure of a transformer model:

Transformer model architecture with encoder, decoder, and attention.

Source: Google

Context window in NLP

The context window is the amount of text (in tokens) a model can process at once.

  • Small window – good for short tasks, but misses longer links.
  • Large window – handles long-range dependencies, but needs more resources.

Other relevant aspects of transformers

Other important aspects of transformers include self-attention, regularization, and positional encodings, which together improve their performance and reliability.

Here is a simple visualization of how self-attention works using query, key, and value vectors:

Self-attention in transformers showing query, key, and value vectors.

Source: Google

Learned weights refer to the matrices:

  • Wᴼ (for Query Q)
  • Wᴷ (for Key K)
  • Wⱽ (for Value V)

These are trainable parameters that the model learns during training.

Scroll to Top