Multimodal AI Assistant

A fully-featured (I would like to think so) AI companion (or a poor man's version of AI Girlfriend) that sees, speaks, and creates—all running locally on your machine.

What is This?

Ever wanted an AI assistant, one that you can set a personality and interact ? One that you can talk to naturally, show images, and have it generate visuals in response? This multimodal AI assistant brings together the best of modern (whatever that is based on my limited understanding) AI capabilities into one seamless (when you have high end PC) experience.

Note: This project is being vibe coded (including this ReadMe as well)—built organically and iteratively as ideas flow.

Key Features

Natural Conversations

Speak naturally using your microphone—no typing required
Multiple input modes: voice, text, or text + images
Real-time voice responses with customizable TTS engines
Phone call mode (with VAD, for smart turns)

Visual Understanding

Show it an image and describe what you want
Powered by local vision-language models for privacy
Understands context from both your words and images

Image Generation

Creates images based on conversation context
The AI can autonomously decide when to generate visuals
Uses local Stable Diffusion for complete privacy

Privacy-First

Everything runs locally—no cloud dependencies required (optional)
Your conversations and images never leave your machine
Optional cloud LLM support for those who don't have powerful hardware

How It Works

Input → Talk, type, or share images with the assistant
Understanding → Vision models describe images, speech is transcribed
Thinking → Your chosen LLM processes everything as natural conversation
Response → Get spoken responses and generated images in real-time

Tech Stack

Speech-to-Text: Faster Whisper (local transcription)
Vision Understanding: Qwen3VL-2B/4B (local vision-language model)
Language Model: Ollama (supports local and cloud deployment)
Image Generation: Stable Diffusion / Qwen Image Edit
Text-to-Speech: Multiple engines (Piper, Chatterbox, Soprano)
Frontend: React + TypeScript + Vite
Backend: FastAPI + WebSockets

Quick Start

1. Clone the Repository

git clone https://github.com/yourusername/multimodal-ai-assistant.git
cd multimodal-ai-assistant

2. Install Dependencies

# Install Python dependencies
pip install -e .

# Frontend is automatically built during pip install
# If you need to rebuild manually:
cd frontend
npm install
npm run build
cd ..

3. Configure Environment Variables

Copy the example environment file and configure it:

# Windows
copy src\aiassistant\.env.example src\aiassistant\.env

# Linux/Mac
cp src/aiassistant/.env.example src/aiassistant/.env

Edit src/aiassistant/.env with your preferred settings. See Configuration Guide below for details.

4. Download Required Models

See Model Setup section for detailed instructions on downloading and configuring models.

5. Run the Application

# Start the backend server
python -m aiassistant.app

# The application will be available at http://localhost:8000

Configuration Guide

All configuration is done through environment variables in the .env file. Here are the key settings:

Low VRAM Mode

LOW_VRAM_MODE=true  # Unloads models after use to save memory

LLM Configuration

Option 1: Local Ollama (Recommended)

LLM_HOST=http://localhost:11434
LLM_MODEL=llama3.2  # or mistral, qwen2.5, etc.
LLM_DEVICE=auto     # auto, cuda, or cpu

Install Ollama from https://ollama.com/download
Pull a model: ollama pull llama3.2
Start Ollama service

Option 2: Ollama Cloud

LLM_HOST=https://ollama.com
LLM_MODEL=glm-4.7:cloud

Get an API key from https://ollama.com

Option 3: Custom Implementation

You can implement your own LLM provider by extending the base class in src/aiassistant/llm/base.py.

Speech-to-Text (Whisper)

WHISPER_MODEL=distil-medium.en  # or tiny.en, base.en, small.en, medium.en, large-v3
WHISPER_DEVICE=cuda             # cuda or cpu
WHISPER_COMPUTE=float16         # float16, float32, or int8

Models are automatically downloaded from HuggingFace on first run.

Text-to-Speech

Option 1: Piper TTS (Default - Fast & Lightweight)

TTS_ENGINE=piper
PIPER_USE_CUDA=true

Download voices from: https://huggingface.co/rhasspy/piper-voices/tree/main

Place .onnx and .json files in: src/models/voices/pipertts/

Option 2: Chatterbox TTS (Expressive)

TTS_ENGINE=chatterbox
CHATTERBOX_DEVICE=cuda

Requires: pip install chatterbox-tts

Option 3: Soprano TTS (Fast & Lightweight)

TTS_ENGINE=soprano
SOPRANO_DEVICE=cuda

Requires: pip install soprano-tts

Image Generation

IMAGEGEN_ENABLED=true
IMAGEGEN_MODEL=prompthero/openjourney  # HuggingFace model ID or local path
IMAGEGEN_DEVICE=cuda
IMAGEGEN_WIDTH=512    # Lower for less VRAM (512x512 = ~6GB VRAM)
IMAGEGEN_HEIGHT=512
IMAGEGEN_STEPS=30     # 20-30 for speed, 40-50 for quality

Popular models:

prompthero/openjourney - Fast, small VRAM
runwayml/stable-diffusion-v1-5 - General purpose
stabilityai/stable-diffusion-2-1 - Better quality

Models are downloaded from HuggingFace on first run, or you can point to a local directory.

ComfyUI Integration (Coming Soon)

You can implement ComfyUI endpoints by extending src/aiassistant/imagegen/base.py.

Image Explainer (Vision-Language Model)

IMAGEEXPLAINER_ENABLED=true
IMAGEEXPLAINER_MODEL=huihui-ai/Huihui-Qwen3-VL-2B-Instruct-abliterated
IMAGEEXPLAINER_DEVICE=auto

Models:

huihui-ai/Huihui-Qwen3-VL-2B-Instruct-abliterated - 2B params, lower VRAM
huihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated - 4B params, better quality

Models are downloaded from HuggingFace on first run.

Model Setup

Directory Structure

Models are stored in src/models/:

src/models/
├── image_explainer/        # Vision-language models (auto-downloaded)
├── image_generation/       # Diffusion models (auto-downloaded)
├── stt/                   # Whisper models (auto-downloaded)
├── tts/                   # TTS models (auto-downloaded)
└── voices/
    └── pipertts/          # Piper voice files (manual download)

Piper Voice Setup

Visit https://huggingface.co/rhasspy/piper-voices/tree/main
Choose a voice (e.g., en_US-lessac-medium)
Download both files:
- en_US-lessac-medium.onnx
- en_US-lessac-medium.onnx.json
Place in src/models/voices/pipertts/

Whisper Model Setup

Models are automatically downloaded on first use. Specify the model name in .env:

WHISPER_MODEL=distil-medium.en

Image Generation Model Setup

Option 1: Auto-download from HuggingFace

IMAGEGEN_MODEL=prompthero/openjourney

Option 2: Use local model

IMAGEGEN_MODEL=/path/to/your/local/model

Image Explainer Model Setup

Models are automatically downloaded on first use:

IMAGEEXPLAINER_MODEL=huihui-ai/Huihui-Qwen3-VL-2B-Instruct-abliterated

Usage

Open http://localhost:8000 in your browser
Click the microphone button to start voice input
Speak naturally or type your message
Attach images by clicking the image button
The AI will respond with voice and/or images

Development

See BUILD_INSTRUCTIONS.md for detailed build system information.

Frontend Development

cd frontend
npm run dev  # Starts dev server with hot reload at http://localhost:5173

Backend Development

python -m aiassistant.app  # Runs backend server

Perfect For

Privacy-conscious users wanting local AI assistants
Anyone who wants a truly interactive AI companion

Future Plans

Live 2D/3D Character with Emotions

The ultimate vision is to have an animated 3D model of your choice that reacts and interacts during conversations, something similar to VTuber. Using technologies like Qwen ControlNet, image editing pipelines, PyGame, Wan, and all other opensource video/image generation models it should be possible to create a fully animated avatar that:

Lip-syncs to generated speech
Shows expressions and reactions based on conversation context
Responds with gestures and body language (essentially a more advanced version of microsoft CLIPPY, yes I am OLD)

Features would include:

Lip-sync to generated speech
Facial expressions based on conversation context
Gestures and body language
Real-time emotion detection and response

This is a long-term pipe dream, but with current AI models advancing rapidly, it should become feasible soon. Think of it as a more advanced, AI-powered Microsoft Clippy! 📎

Other Planned Features

Support Docker
Multi-character support (character profiles)
Conversation memory and context management
Voice cloning for personalized TTS
Custom image generation styles and LoRA support
Export/import conversation history
Plugin system for custom engines

Contributing

Contributions are welcomed! Whether you want to:

Add support for new LLM providers
Implement new TTS/STT engines
Integrate ComfyUI or other image generation backends
Improve the UI/UX
Fix bugs or add features

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Test thoroughly
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Acknowledgments

Claude for keeping up with my weird requests

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.vscode		.vscode
docs		docs
frontend		frontend
src/aiassistant		src/aiassistant
.gitattributes		.gitattributes
.gitignore		.gitignore
BUILD_INSTRUCTIONS.md		BUILD_INSTRUCTIONS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
hatch_build.py		hatch_build.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Multimodal AI Assistant

What is This?

Key Features

Natural Conversations

Visual Understanding

Image Generation

Privacy-First

How It Works

Tech Stack

Quick Start

1. Clone the Repository

2. Install Dependencies

3. Configure Environment Variables

4. Download Required Models

5. Run the Application

Configuration Guide

Low VRAM Mode

LLM Configuration

Speech-to-Text (Whisper)

Text-to-Speech

Image Generation

Image Explainer (Vision-Language Model)

Model Setup

Directory Structure

Piper Voice Setup

Whisper Model Setup

Image Generation Model Setup

Image Explainer Model Setup

Usage

Development

Frontend Development

Backend Development

Perfect For

Future Plans

Live 2D/3D Character with Emotions

Other Planned Features

Contributing

How to Contribute

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages