Sinusoidal Positional Encoding Visualizer

Giving Transformers a sense of order.

2D Vector Rotation View

This visualization shows how adding positional encodings causes word embeddings to "rotate" in 2D space. Each word starts with a base embedding (shown in gray), and when we add its positional encoding, the resulting vector (shown in color) points in a slightly different direction. Words at different positions get different rotations, giving the model a geometric way to understand word order.

Enter a sentence:

Watch how each word's vector rotates based on its position in the sentence.

Legend:

Original word embedding

After adding positional encoding

How it works: The Sinusoidal Method

Transformer models use a fixed mathematical function to generate a unique vector for every position ($pos$) in a sequence. This vector is then added to the word's embedding. The encoding uses sine and cosine functions with drastically varying frequencies across the embedding dimensions ($i$):

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

*The visualization below shows each of the **4 dimensions (2 pairs of sine/cosine waves)** on its own stacked line. We have manually set their periods for clear illustration: The first pair (D0, D1) completes 2 full periods, and the second pair (D2, D3) completes 1 full period over the position range (0-100).*

Position Visualizer

Position (pos): 0

Current Positional Vector (First 4 Dimensions):

Sentence Positional Embeddings

Enter a sentence:

Click on a word below to see its positional encoding in the graph.

Word Embeddings (Click to Visualize):

Enter a sentence above to see word positional embeddings.

Why Different Frequencies?

Sinusoidal positional encodings use multiple frequencies (fast and slow oscillations) to create a unique "fingerprint" for each position. This is similar to how binary numbers work, but in continuous space!

🔢 Binary-like Encoding

Fast frequencies (high dims) = ones place. Slow frequencies (low dims) = hundreds place. Together they create unique position IDs.

🎯 Uniqueness

Each position gets a unique pattern across all dimensions. No two positions have the same encoding.

📏 Local & Global

High frequencies distinguish nearby positions. Low frequencies distinguish far-apart positions.

1. Binary-like Pattern

This demonstrates the **"binary" idea**: each dimension is a step function, creating a unique on/off sequence (a positional fingerprint) across all four dimensions.

Position (pos): 0

Positional Encoding Steps (D0-D3)

Current PE Vector

2. Position Uniqueness Heatmap

Darker colors = more similar encodings. Each position has a unique pattern. Notice positions far apart are very different (light colored).

3. Local vs Global: Different Frequencies Measure Different Distances

Each frequency acts like a "measuring stick" for word distances. High frequencies (fast oscillation) measure short distances well but can't distinguish far positions. Low frequencies (slow oscillation) measure long distances well but can't distinguish nearby positions.

High Frequency (Fast Oscillation)

Good for nearby positions (1-10 apart), useless for far positions (50+ apart)

Low Frequency (Slow Oscillation)

Good for far positions (50+ apart), but nearby positions look too similar

💡 Key Insight: By using BOTH high and low frequencies (many dimensions), the model gets multiple "measuring sticks" that work at different scales. This is why transformers typically use 512 dimensions - to have enough frequencies to accurately measure all possible word distances!