The interaction between humans and machines has made enormous progress in recent years. Natural Language Processing by Artificial Intelligence (AI) in particular has developed rapidly. However, one area that remains a challenge is the so-called "Grounding" of language, i.e. linking words and sentences to real or virtual environments. This is particularly complex when multiple perspectives have to be considered, as is the case, for example, in communication between multiple agents in a shared environment.
A research team has addressed this challenge and presented a new task and associated dataset for the generation and understanding of referring expressions in multi-agent environments. In this task, two agents sharing a scene must consider each other's visual perspective in order to produce and understand references to objects in the scene and their spatial relationships to each other.
Imagine trying to describe to a friend over the phone where to find a particular object in a room. You describe the position of the object from your point of view, but your friend sees the room from a different perspective. In this case, you need to be able to take your friend's perspective in order to provide them with an understandable description. The same is true for AI agents interacting in multi-agent environments.
To address this challenge, the researchers have created a dataset of 2,970 human-written referring expressions, each paired with human comprehension ratings. This dataset was used to evaluate the performance of automated models as speakers and listeners in interaction with human partners.
The results show that the models' performance in both generating and understanding references lags behind that of human agent pairs. This suggests that there is still much room for improvement in developing AI systems that can understand and use language in complex, multi-perspective environments.
One promising approach explored in the study is training an open-weight speaker model with cues to communicative success in combination with a listener model. This approach led to an improvement in communicative success from 58.9% to 69.3%, even outperforming the strongest proprietary model.
The research presented highlights the importance of developing AI systems that are able to understand language in its entirety, including the ability to consider different perspectives and link language to real or virtual environments. Future research in this area could focus on the development of multimodal models that combine different types of information, such as visual and language data, to enable a deeper understanding of language and its relation to the world.
The development of such AI systems is an important step towards more natural and effective human-machine interaction and opens up new possibilities in areas such as robotics, virtual assistants and autonomous systems.