Multimodal models still can't ground language in embodied experience
Q1I've been reviewing recent multimodal models — GPT-4V, Gemini, Claude's vision — and while they're impressive at describing images and reasoning about visual content, I keep feeling like something fundamental is missing. They don't actually 'experience' the visual world. They process pixel patterns correlated with text descriptions. Is this a meaningful distinction or am I being too philosophical?
Q2That's exactly it. So what would a truly grounded AI look like? I've been thinking about robotics labs like Sergey Levine's group — they train policies through physical interaction. But those systems can't do language reasoning. And LLMs can reason but lack grounding. How do we bridge this gap?