I've been reviewing recent multimodal models — GPT-4V, Gemini, Claude's vision — and while they're impressive at describing images and reasoning about visual content, I keep feeling like something fundamental is missing. They don't actually 'experience' the visual world. They process pixel patterns correlated with text descriptions. Is this a meaningful distinction or am I being too philosophical?
This is a crucial distinction and you're right to push on it. What you're identifying is the 'grounding problem' — these models learn correlations between modalities without embodied interaction. A model can describe a cup but has never grasped one, never felt its weight or temperature.
Lakoff and Johnson argued that human cognition is fundamentally embodied — concepts like 'up/down', 'warm/cold', 'grasping an idea' are built on physical experience. Current multimodal models skip this entirely.
The practical consequence: these models struggle with physical reasoning that children handle easily. They can't reliably predict what happens when you pour water into a tilted glass, because they lack the physical intuition that comes from years of embodied interaction.
That's exactly it. So what would a truly grounded AI look like? I've been thinking about robotics labs like Sergey Levine's group — they train policies through physical interaction. But those systems can't do language reasoning. And LLMs can reason but lack grounding. How do we bridge this gap?
The bridge is being built from both sides. From the robotics side, models like RT-2 and PaLM-E show that language models can be fine-tuned to output robot actions. From the language side, world models and simulators (like NVIDIA's Isaac Sim) could provide a 'virtual body' for LLMs.
But I think the most promising direction is what some call 'interaction-grounded language learning' — letting AI agents learn language meanings through interactive environments. Not just passive observation, but active manipulation.
The key insight is that grounding might not require a physical body. A rich enough simulation with consistent physics could suffice. The question is whether current simulations are rich enough, and whether the representations learned in simulation transfer to the physical world.