Recently, our school celebrated the 60th anniversary of Computer Science & AI. To mark the occasion, the organizers invited Fernando Pereira to deliver a lecture on the connection between form and meaning in language. This subject has captivated the minds of linguists, computer scientists, and cognitive researchers for many years.

During the lecture, Pereira presented a thought-provoking example. He posed the question: “Bob is sitting to the right of Alice, and Jack is to the left of Bob. If everyone passes their name tag to the person on their right, who ends up with which name tag?”

Surprisingly, when this was tested on a Large Language Model (LLM), it failed to provide the correct answer (haven’t tried it myself though, but it doesn’t matter). This sparked a curiosity in me: Can LLMs acquire spatial intelligence?

One potential solution that crossed my mind was integrating a text-to-image component with the text encoder. By generating images, we might be able to retain the spatial information that the language encoder misses.

Seeking further insight, I reached out to a friend specialising in computer vision at Columbia University. However, he expressed skepticism. He pointed out that current image generation models, like stable diffusion, DALLE, and Imagen, rely heavily on CLIP for text comprehension. Thus, if LLMs grapple with understanding spatial relations, the same challenges would likely plague image generation.

While we both recognised that converting the problem into code or logical inference might address the issue, it diverges from the original goal: to endow LLMs with spatial intelligence. Moreover, as spatial scenarios become more intricate, creating a logical representation grows increasingly challenging.

For now, I’m leaving this thought here. As I delve deeper into this topic and gather more insights, I’ll be sure to update this post.