Vision Language-based Embodied Control through Interactive Reasoning

This project demonstrates how Vision Language Models (VLMs) can enhance robotic systems through vision understanding and natural language interaction. It implements a function-calling interface that connects VLMs to robotic perception/control modules for two core tasks: Room-to-Room Navigation (R2R) and Embodied Question Answering (EQA). The system is evaluated in AI2-THOR simulation, showing how foundation models can improve robotic perception and planning.

Demo

EQA_succ_fast_handbrake.mp4

Getting Started

Prerequisites

Python 3.9+ (python 3.11.5 was used for the development).

Installation

Clone the repository:

git clone https://github.com/tommasoTubaldo/Application_of_VLMs_in_Robotics.git
cd Application_of_VLMs_in_Robotics

Install dependencies:

pip install -r requirements.txt
pip install -U google-genai

Note: Use a virtual environment (python -m venv venv and then source venv/bin/activate) to avoid conflicts.

Configuration

Choose either Vertex AI (recommended) or Gemini API:

Option 1: Vertex AI (Reccomended)

With the Google Cloud Services, you are given $300 of credit to be used with the Vertex AI services, and 90 days of free trial.

Set Up Google Cloud Project:
- Go to Google Cloud Console.
- Click "Create a new project", name it and confirm.
- Enable Billing:
  - Follow this guide to link a billing account.
- Enable Vertex AI API:
  - Visit the API enablement page and select your project.
Install the Google Cloud CLI:
- Install the Google Cloud SDK.
- Authenticate and log in:
```
gcloud auth application-default login
```
Configure environment variables:
- Run these commands on your project directory:
```
export PROJECT_ID="<your_project_id>"
export LOCATION="<your_location>"
export API_MODE="vertex"
```
- You can find the project id by entering the Google Cloud Console and following these instructions.
- You can choose the location by refering to Vertex AI regions.

Option 2: Gemini API (Simpler, but rate-limited)

Get an API key:
- Go to Google AI Studio
- Click "Create API key" and copy the key.

Configure environment:

In your project directory:

cd ~/your_project
export GEMINI_API_KEY="<your_api_key>"
export API_MODE="gemini"

Run the project

To run the project, simply run:

python3 main.py

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.idea		.idea
data		data
prompts		prompts
thortils		thortils
LICENSE		LICENSE
README.md		README.md
ai2_thor_routines.py		ai2_thor_routines.py
eval.py		eval.py
gemini.py		gemini.py
keyboard_player.py		keyboard_player.py
main.py		main.py
requirements.txt		requirements.txt
test.py		test.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Language-based Embodied Control through Interactive Reasoning

Demo

Getting Started

Prerequisites

Installation

Configuration

Option 1: Vertex AI (Reccomended)

Option 2: Gemini API (Simpler, but rate-limited)

Run the project

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision Language-based Embodied Control through Interactive Reasoning

Demo

Getting Started

Prerequisites

Installation

Configuration

Option 1: Vertex AI (Reccomended)

Option 2: Gemini API (Simpler, but rate-limited)

Run the project

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages