Following the Vector Paths: How LLMs Navigate from Question to Answer
what actually happens inside? The answer lies in tracing vector paths through
high-dimensional embedding space.
Understanding Vector Embeddings
Vector embeddings transform discrete tokens into continuous numerical representations. Each word becomes a point in a 768 to 4096-dimensional space.
Token to Vector Transformation

Each token transforms into a high-dimensional vector representation. Each token receives a unique vector that encodes semantic meaning.
Semantic Vector Relationships
In embedding space, semantic similarity corresponds to geometric proximity. The model learns that “sky” clusters near “atmosphere” with 0.85 cosine similarity.

Vector Space Clustering
Semantic similarity shown as geometric proximity
“Sky” Vector Neighborhood
- atmosphere (~0.85 similarity)
- clouds (~0.82 similarity)
- air (~0.79 similarity)
- weather (~0.76 similarity)
“Blue” Vector Neighborhood
The vector for “blue” activates clusters connecting color perception to physical properties of light.
- color (~0.88 similarity)
- wavelength (~0.71 similarity)
- light (~0.68 similarity)
The Attention Mechanism Journey
Transformer attention mechanisms route information through 32+ layers, where each layer refines understanding through query-key-value transformations.

Layer 1-5: Syntax and Grammar
Initial layers detect structural patterns. The query vector identifies that “why” signals a causal explanation request.
Layer 6-15: Semantic Understanding
Middle layers activate domain knowledge. The combined [sky, blue] representation triggers optical phenomenon concepts, pulling in vectors for
light, atmosphere, and scattering.
Semantic Activation Pattern
Query: [sky, blue] → optical phenomenon
Key: Activates {light, color, atmosphere} clusters
Value: Retrieves scattering concepts
Result: Physics domain activation
Layer 16-25: Knowledge Retrieval
Deep layers access factual relationships stored in network weights. Rayleigh scattering emerges with associated wavelength mathematics.
Retrieved Physical Relationships
λ_blue = 450nm (wavelength of blue light)
scattering ∝ 1/λ⁴ (inverse fourth power law)
atmosphere = N₂ + O₂ (molecular composition)
Rayleigh → Lord Kelvin → elastic scattering
These relationships exist as learned weight patterns across billions of parameters.
Layer 26-32: Reasoning and Assembly
Final layers construct causal chains. The model assembles the sequence:
sunlight → atmosphere → molecular scattering → wavelength selection → blue perception.
Multi-Head Attention Specialization
Transformer models employ multiple attention heads that specialize in different reasoning types. Each head learns distinct semantic relationships.
Specialized Processing Heads
- Head 1 – Subject Identification: Assigns high weight (0.89) to “sky”
- Head 2 – Property Attribution: Links subject to attribute (“sky” ↔ “blue”) with 0.92 weight
- Head 3 – Causal Reasoning: “why” activates explanation mode, pulls physics knowledge vectors
- Head 4 – Entity Relationships: Tracks sequential dependencies sun → light → atmosphere → eye
- Head 5 – Comparative Reasoning: Contrasts related concepts (blue vs red wavelengths, violet vs blue perception)
Hidden State Evolution
As information flows through the network, each token’s vector representation evolves to incorporate contextual understanding.
Vector Evolution for “sky”
The representation shifts position in embedding space, moving closer to relevant physics concepts.
Geometric Operations in Vector Space
The model’s reasoning manifests as geometric operations. Semantic drift
describes how vectors move through space to approach related concepts.
Attention Weight Distribution
When processing “blue”, the attention mechanism distributes focus across semantically related tokens:
"blue" attends to:
- "light" (0.87 weight)
- "wavelength" (0.79 weight)
- "scattering" (0.82 weight)
- "atmosphere" (0.76 weight)
Semantic Distance Reduction
Through layer processing, the “blue” vector transitions from color space to physics space. Its distance to “wavelength” decreases dramatically: 0.42 →0.81 cosine similarity.
Output Generation: From Vectors to Text
Final layer representations project onto vocabulary space, producing probability distributions over 50,000+ possible next tokens.
Next Token Prediction
After "The sky is blue because"...
P("light") = 0.23 ← High probability
P("of") = 0.18 ← Grammatical connector
P("Rayleigh") = 0.15 ← Technical term
P("sunlight") = 0.12 ← Alternative phasing
P("the") = 0.09 ← Generic article
...
The model samples from this distribution, typically selecting high-probability tokens while maintaining coherence.
The Complete Vector Journey
From “why is the sky blue?” to a complete explanation, the process traverses billions of learned parameters.
The Path Summary

The Complete Vector Journey From question to answer: The full processing pipeline
- Tokenization:
Text splits into processable units - Embedding:
Tokens become high-dimensional vectors - Attention:
Vectors exchange information through learned patterns - Transformation:
Representations evolve through 32+ layers - Knowledge Retrieval:
Physics concepts activate from weight patterns - Reasoning:
Causal chains assemble in late layers - Generation:
Vectors project to vocabulary, producing text
Following the Path Forward
Understanding these vector paths reveals how language models think. Rather than retrieving pre-written answers, they navigate learned geometric relationships in high-dimensional space.
Each query initiates a unique journey through this learned landscape, where semantic proximity guides reasoning and geometric operations produce understanding.