Attention

An intuitive explanation of attention mechanisms in transformers, powering large language models like ChatGPT.

The word "attention" gets thrown around a lot in machine learning, but I've found that most explanations either drown you in linear algebra or wave their hands and say "it's like how humans focus." Neither is satisfying. Here's how I actually think about it.

A transformer has two main parts: an encoder and a decoder. The encoder converts input words into vectors. Neural networks don't understand text, so we need representations they can work with. These vectors capture semantic information (what the word means), positional information (where it appears), and attention information (how it relates to other words). The decoder does something similar, but generates output tokens one at a time.

The key insight is that words don't have fixed meanings. The word "bank" means something different in "river bank" than in "bank account." Attention is the mechanism that lets a model figure out which other words in a sentence should influence how it interprets any given word.

Here's the intuition. While reading this paragraph, you're involuntarily focusing on some words more than others. Your brain forms relationships between words even when they're far apart, because you implicitly know which words are "useful" for understanding which other words. Attention in transformers works the same way. It calculates a score, essentially a usefulness score, for each token with respect to every other token.

The implementation uses three vectors per token: Query, Key, and Value. This seems arbitrary until you realize it's borrowed from database retrieval. Think of a database with keys and corresponding values. The keys are topics, the values are information about those topics. When you query the database, you find which keys are similar to your query, then retrieve their corresponding values.

Attention works the same way. The Query asks "what am I looking for?" The Keys say "here's what I have." The Values contain the actual information to retrieve.

For two vectors U and V, similarity is just their dot product. Since we represent tokens as vectors, the dot product tells us how similar two tokens are in embedding space.

The calculation goes like this. First, find similarity between the Query and all Keys using dot products. This gives you a vector of similarity scores. Second, normalize these scores with softmax to get weights that sum to one. Third, multiply these weights by the corresponding Values and sum. The result is a new vector representation that encodes "what this token should attend to."

The formula from the original paper is:

The scaling term (dividing by the square root of the key dimension) keeps gradients from exploding during training. It's a practical fix, not a deep theoretical requirement.

Self-attention means calculating attention of tokens with respect to other tokens in the same sentence. Encoder-decoder attention means calculating attention between encoder representations and decoder representations, so between different sentences.

Masked self-attention is a variant where each token can only attend to tokens that came before it. The decoder uses this because it's predicting the next token and shouldn't be able to see the future. You implement it by setting attention scores to negative infinity for future positions before the softmax, which drives their weights to zero.

Multi-head attention runs several attention computations in parallel, each with its own learned Query, Key, and Value projections. The idea is that different heads can learn to attend to different things: one head might focus on syntactic relationships, another on semantic similarity, another on positional patterns. The outputs get concatenated and projected back down to the original dimension.

The paper puts it this way: "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions." That's dense, but accurate.

What I find interesting about attention is how simple the core mechanism is. It's just weighted averaging, where the weights come from learned similarity functions. The power comes from stacking many layers of this, letting the model build increasingly abstract representations where each token's meaning is informed by its full context.

There's something almost circular about it. The model learns what to pay attention to by being trained on data where attention patterns matter. It's not that we designed attention to capture linguistic structure. We designed a flexible mechanism for learning contextual relationships, and linguistic structure is what emerged.

I've glossed over many details here. The actual implementation involves careful thinking about matrix shapes, initialization, normalization, and how attention layers compose with feed-forward layers. There are also many variants: sliding window attention, sparse attention, linear attention approximations.

But the core idea remains the same. Figure out what's relevant to what, and let that inform your representations.

The original paper was called "Attention Is All You Need." That turned out to be approximately true.