Giving Transformers a sense of order.
This visualization shows how adding positional encodings causes word embeddings to "rotate" in 2D space. Each word starts with a base embedding (shown in gray), and when we add its positional encoding, the resulting vector (shown in color) points in a slightly different direction. Words at different positions get different rotations, giving the model a geometric way to understand word order.
Watch how each word's vector rotates based on its position in the sentence.
Transformer models use a fixed mathematical function to generate a unique vector for every position ($pos$) in a sequence. This vector is then added to the word's embedding. The encoding uses sine and cosine functions with drastically varying frequencies across the embedding dimensions ($i$):
*The visualization below shows each of the **4 dimensions (2 pairs of sine/cosine waves)** on its own stacked line. We have manually set their periods for clear illustration: The first pair (D0, D1) completes 2 full periods, and the second pair (D2, D3) completes 1 full period over the position range (0-100).*
Click on a word below to see its positional encoding in the graph.
Enter a sentence above to see word positional embeddings.
Sinusoidal positional encodings use multiple frequencies (fast and slow oscillations) to create a unique "fingerprint" for each position. This is similar to how binary numbers work, but in continuous space!
Fast frequencies (high dims) = ones place. Slow frequencies (low dims) = hundreds place. Together they create unique position IDs.
Each position gets a unique pattern across all dimensions. No two positions have the same encoding.
High frequencies distinguish nearby positions. Low frequencies distinguish far-apart positions.
This demonstrates the **"binary" idea**: each dimension is a step function, creating a unique on/off sequence (a positional fingerprint) across all four dimensions.
Darker colors = more similar encodings. Each position has a unique pattern. Notice positions far apart are very different (light colored).
Each frequency acts like a "measuring stick" for word distances. High frequencies (fast oscillation) measure short distances well but can't distinguish far positions. Low frequencies (slow oscillation) measure long distances well but can't distinguish nearby positions.
Good for nearby positions (1-10 apart), useless for far positions (50+ apart)
Good for far positions (50+ apart), but nearby positions look too similar
💡 Key Insight: By using BOTH high and low frequencies (many dimensions), the model gets multiple "measuring sticks" that work at different scales. This is why transformers typically use 512 dimensions - to have enough frequencies to accurately measure all possible word distances!