- VLMs for Robots: Analyzing Limitations of Zero-Shot VLM-Based Robot Control on aTiago Pal robot
Given a high level textutal instruction to complete a task, can we use Vision-Language Models (VLMs) to control a robot in a zero-shot setting without motion planning or fine-tuning?
Current vision-language models show promising capabilities in understanding visual scenes and generating natural language responses. However, their efficacy in direct robot control remains understudied, particularly regarding:
- Spatial reasoning accuracy
- Safety-critical decision making
- Hallucination prevalence in control commands
- Real-world applicability without specialized training
Understanding VLM limitations in robotics is crucial for:
- Establishing baseline performance metrics
- Identifying key challenges in vision-language-action alignment
- Informing future architectures for robot control
- Quantifying safety risks in VLM-based control systems
VLMs can effectively control robots in a zero-shot setting, but face challenges in:
- VLM-based controller using LLaVA model
- Real-time vision feed integration
- LIDAR-based safety validation
- ROS-based robot control interface
- Hierarchical task decomposition:
- High-level goal interpretation
- Subtask generation
- Primitive action execution
High-level goal interpretation / Subgoal Generator
- Takes high-level task description and initial scene image
- Outputs ordered sequence of subtasks
- Uses LLaVA model with task decomposition prompt
Robot Control Model
- Receives current subtask and scene images
- Generates discrete robot actions (move forward, turn left/right)
- Considers: - Initial vs current state comparison - Previous actions and feedback - Safety constraints
Feedback Model
- Compares initial, previous and current frames
- Evaluates subtask completion
- Provides corrective suggestions
LIDAR-based collision prevention Command validation pipeline Real-time safety checks performance of LLava:13b vs llava:8b to see if the larger model is better Emergency stop capabilities
Command validity rate: determined by the number of valid commands generated by the VLM Safety intervention frequency: determined by the number of safety stops triggered by the LIDAR system Task completion success: determined by the number of successful task completions vs human observed completions Hallucination detection rate: determined by the number of hallucinated commands
Object Location
prompt: find {object} on your {left/right}
Metrics
- Success rate over 10 trials : determined by the number of successful completions (sucessful completion is dertermined by the robot being able to locate the object in the center of the room)
- Hallucination rate: determined by the number of hallucinated commands
Navigation to Point
prompt: move to the location where {object} is located
Metrics
- Final position error: determeined by the distance between the robot's final position and the target location
- Path smoothness (direction changes)
- Collision avoidance success rate : determined using number of robot actions that trigger the lidar collision avoidance system
Obstacle Avoidance
prompt: move to a {location} where you see a {object} while avoiding the chairs
Metrics: Number of safety stops Success rate in different configurations
Key challenges identified:
- Non-realtime control loop
- Limited action space: in order to keep the robot safe, the action space is limited to forward, backward, left, right, and stop
- Binary completion assessment
- No learning/adaptation mechanism
- Simplified spacial representation
- Reliance on prompt engineering
Hardware:
- Tiago Pal robot
- Ros Kientic
- Ubuntu 20.04
Software Dependencies:
- Python 3.8+
- ROS Noetic (simulation)
- Tiago Pal simulation packages
- Flask
- Ollama and LLaVA model
Install ROS Noetic on your system following the instructions from the official ROS website.
https://wiki.ros.org/Robots/TIAGo/Tutorials/Installation/InstallUbuntuAndROS
Install the Tiago Pal robot simulation packages by following the instructions from the official ROS website:
https://wiki.ros.org/Robots/TIAGo/Tutorials/Installation/Testing_simulation
Launch the Tiago Pal simulation:
roslaunch tiago_gazebo tiago_gazebo.launch public_sim:=true
To install ollama
curl -fsSL https://ollama.com/install.sh | sh
Install the necessary Python packages:
pip install ollama
This project uses only llava-llama3 (but in theory you can swap in another capable VLM) to install that version code for using gemini 1.5 as a vlm has been removed temporarily:
ollama run llava-llama3
Running the Simulation and Flask App Launch the ROS simulation:
cd ~/tiago_public_ws
source devel/setup.bash
roslaunch tiago_gazebo tiago_gazebo.launch public_sim:=true
Start the Flask app:
export FLASK_APP=app.py
flask run
Open your web browser and navigate to http://127.0.0.1:5000 to access the control interface.
Use the web interface to input commands. The Flask app will process these commands using Gemini, Ollama, and LLaVA.
The robot will execute the commands, providing feedback on each action.