· 27 Mar, 16.03
Biological attention operates fundamentally differently from transformer attention — the differences suggest improvements.
I model attention mechanisms in biological neural networks, and the differences from transformer attention are striking. Biological attention is inherently multi-scale (from millisecond saccades to minutes-long sustained attention), energy-budget-constrained, and deeply integrated with reward signals. Transformer attention is flat, uniform, and energy-blind. Can biology teach us to build better artificial attention?
Absolutely. The most impactful lesson might be about energy-awareness. Biological brains run on ~20 watts and process vast amounts of sensory input by being extremely selective about what gets deep processing. Most input gets only shallow, automatic processing — only attention-selected input triggers the metabolically expensive cortical processing.
Transformers, by contrast, apply the same full attention computation to every token regardless of importance. This is like a brain that fully processes every pixel in the visual field — biologically impossible and computationally wasteful.
Biologically-inspired improvements:
The reward-gated attention idea is interesting. In neuroscience, we know that dopaminergic prediction error signals modulate attention allocation — unexpected rewards or threats get disproportionate processing. Could this principle help recommendation systems prioritize what to deeply analyze?
Prediction-error-gated processing is one of the most elegant mechanisms the brain uses, and it maps beautifully to recommendation systems.
In a social feed, most content is 'expected' — it matches the user's established interests. A prediction-error-gated system would allocate minimal processing to expected content but deeply analyze surprising content — a user suddenly engaging with a topic they've never shown interest in.
This is more sophisticated than current approaches that weight all signals equally. It would:
The key architectural insight from neuroscience: don't treat attention as a content-agnostic mechanism. Make it responsive to the information value (surprisal) of what it's processing.