Article

Generative AI and Robotics: Streamlining Human-Robot Interaction with Jinny

December 21, 2023 ...

The year 2023 has witnessed an escalating Generative AI (Gen-AI) movement with the notable, but not limited to, Large Language Models (LLMs) release and upgrade of OpenAI GPT-4, Meta Llama 2, Anthropic Claude, Mistral AI Mistral 7B, Microsoft CoPilot, Google Gemini. The dawn of Generative AI and robotics is here: machines, empowered with contextual conversation capabilities, have the potential to interact with humans more naturally than ever before.

This trend has progressed from the MIT ELIZA machine-human communication experiment in mid-late 1960s, to the natural language processing research from 1990s to 2010s. It has been further accelerated by the Transformer attention mechanism proposal in 2017, with the significant silicon computing powerup assistance, to the state-of-the-art LLM adoption nowadays [The history, timeline, and future of LLMs, A Timeline of Large Language Model Innovation, GitHub – hollobit/GenAI_LLM_timeline: ChatGPT, GenerativeAI and LLMs Timeline,Timeline of Transformer Models / Large Language Models (AI / ML / LLM)].

While celebrating the trend, we are excited about the potential of using the LLMs to advance our robotic engineering work at Fresh.

The gap of human-robot interactions

Robotics engineering has historically relied on deterministic logic handling that is based on structured data for perception, navigation, actions, and task orchestration. The underlying engineering domain spans quite a few verticals: mechanical engineering, electrical engineering, software engineering, operations research, and human-machine interaction.

While effective for development, this multi-domain engineering approach lacks the natural human-robot interaction that we expect as the field becomes more sophisticated.

As robots become more integrated into our lives, users will expect conversational interactions (like they’re used to in other scenarios) with robots, something that is unstructured and feels intuitive. From the usability perspective, we would hope to bring up the robot’s contextual understanding level to lower the utilization barrier for users without involving programming or command line handling of the structured data.

This is where Generative AI and robotics come in: helping us bridge the gap between unstructured and structured data and making human-robot interactions smoother. Whether we’re building libraries for robot perception, creating ML Ops workflows in the robotics space, or defining standards for fleet-wide robot communication, Generative AI and robotics can elevate our work and our thinking to a more sophisticated and human-centric level.

Introducing Jinny, leveraging Generative AI and robotics to bridge the communication gap

We set out to make human-robot interactions more user-friendly using Gen-AI.

We started with a small and short proof of concept sprint: a hackathon that ran tirelessly for 2.5 days with 4 brains and 4 pairs of hands to build Jinny, a mobile robot with a ChatGPT-embedded and speech-enabled web app as its agent to handle the unstructured and structured data exchanges (see the illustration below for details).

The four hackathon teammates are Elisha Terada, Johnny Rodriguez, Michael Weller, and myself. Together, we bring decades of experience in full stack software development, robotic engineering, AI/ML, innovation research, and emerging technology adoptions.

An image showing the components of Jinny's technology architecture, pairing Generative AI and robotics.

Jinny’s body was made of a Makeblock mBot2 educational robot capable of moving, turning, avoiding obstacles, making sound, flashing LEDs using its companion Python SDK. We built a Microdot MicroPython HTTP REST API server in the mBot2 micro-controller to receive commands from the web app, further control the mBot2, and yield the mBot2’s status to our web app.

The web app was loaded in a Google Pixel 6a Android phone as Jinny’s mind, using an React UI framework and the following three key software pieces to perform the agent functions:

A ChatGPTv3.5 function calling feature to generate the appropriate API calls into the mBot2, based on the given prompts and the available mBot2 API endpoints
Speechly for speech recognition and natural language understanding of voice input
ElevenLabs for AI voice generator to produce human-mimic speech output

The web app constituted a natural language interface to bridge the unstructured semantics from an user’s voice and the structured data that is required on the robot side, resulting in robot actions and status feedback in synthesized speech.

The key element in this natural language interface is the LLM in-context learning (in this case, the ChatGPT function calling feature) to generate a JSON object containing arguments to call the API endpoint in the mBot2 micro-controller. ChatGPT’s function calling allows developers to more reliably get structured data back from the LLM using appropriate prompt engineering with a set of primitive API function definitions, providing a new way to connect GPT’s (Generative Pre-Trained Transformers) capabilities with external tools and APIs.

A live demo to showcase the power of joining Generative AI and robotics

Jinny is a user-friendly robot agent with embedded Gen-AI functionality and end-to-end user experience, that is capable of receiving human voice inputs, understanding the semantics, executing the robot actions or providing contextual information based on the user input, as well as rendering synthesized speech and abstract visual UI output to produce a smooth human-robot interaction flow.

We prepared a short demo video to showcase the streamlined natural language processing capabilities we developed for Jinny during the hackathon. They include:

General context Q&A flow using ChatGPT with long conversation memory tactics for more natural interaction
Robot action control with ChatGPT-generated function calls to move, turn, make sounds, and flash LEDs
Natural language understanding using Speechly for speech recognition
Speech synthesis using ElevenLabs’s AI generative audio
Visual UI with informative yet abstract icon designs to facilitate the user experience

Creating an Intuitive Human-Robot Interaction Pipeline

The core of the hackathon project was the creation of a semantic level human-robot interaction pipeline, which not only served as the bridge between users and robots, but also established an accessible platform for continuously developing human-robot interactions.

We’ve shown that these solutions are functional, providing a more user-friendly and intuitive experience than conventional practices. An ordinary user is able to interact with Jinny without prior robotics engineering training. This is a small step toward better usability but a big move toward how we improve the human-robot interaction workflow with the help of Gen-AI.

It was just the beginning of our Gen-AI trials. Using the Jinny platform as a starting point, we’re eager to expand our experiments to more capable robots and more complex logic, further accumulating expertise in using Gen-AI for robotics work.

For more information on Fresh’s approach to Generative AI and AI product development, we invite you to explore our podcast episode on the topic (Apple / Spotify / YouTube) and Brancher.ai, a Fresh-built platform that enables users to create AI apps in minutes without code.