Imagine teaching robots that can learn to perform complex tasks just by explaining it to them in plain English. In some ways, it’s about reimagining our relationship with technology, where machines understand the context and cultural significance behind their tasks, where AI might be guided by the rich, nuanced language of human expression, making technology more accessible and fostering a deeper connection between humans and machines!
Juan Rocamonde and his team at FAR AI recently published their research paper on Vision Language Models (VLM) as Reward Models (RM). Specifically, they use natural language with the power of the CLIP (Contrastive Language–Image Pre-training, the VLM behind OpenAI’s DALL-E) model to instruct humanoid robots in the virtual world to perform tasks, like kneeling, performing the splits, and assuming a yoga lotus position.
Preparing the blog post and promoting this research on social media presented a unique set of challenges and learnings. Surprisingly, I found digesting and understanding the technical aspects to be the easier part. The concepts, while straightforward, are undeniably captivating. However, my lack of experience with social media, particularly with tweeting—or should I say, posting on X—posed the greater challenge. How long is this supposed to be? Who do we tag? How do we structure it for maximum engagement? It turns out, mastering these elements is essential in the machine learning community, as many researchers rely on this for the latest updates. Despite my initial reservations, I’m now engaging on the platform with a mix of pride and bewilderment, marking my unexpected entry into the world of social media.
Discover the details of their fascinating work in the explainer blog and also the Twitter thread. This work showcases the potential for intuitive human-machine interactions, emphasizing the need for accessibility and cultural context in AI development, paving the way for technology to integrate more seamlessly into our lives through natural language.