A fully-featured (I would like to think so) AI companion (or a poor man's version of AI Girlfriend) that sees, speaks, and creates—all running locally on your machine.
Ever wanted an AI assistant, one that you can set a personality and interact ? One that you can talk to naturally, show images, and have it generate visuals in response? This multimodal AI assistant brings together the best of modern (whatever that is based on my limited understanding) AI capabilities into one seamless (when you have high end PC) experience.
Note: This project is being vibe coded (including this ReadMe as well)—built organically and iteratively as ideas flow.
- Speak naturally using your microphone—no typing required
- Multiple input modes: voice, text, or text + images
- Real-time voice responses with customizable TTS engines
- Phone call mode (with VAD, for smart turns)
- Show it an image and describe what you want
- Powered by local vision-language models for privacy
- Understands context from both your words and images
- Creates images based on conversation context
- The AI can autonomously decide when to generate visuals
- Uses local Stable Diffusion for complete privacy
- Everything runs locally—no cloud dependencies required (optional)
- Your conversations and images never leave your machine
- Optional cloud LLM support for those who don't have powerful hardware
- Input → Talk, type, or share images with the assistant
- Understanding → Vision models describe images, speech is transcribed
- Thinking → Your chosen LLM processes everything as natural conversation
- Response → Get spoken responses and generated images in real-time
- Speech-to-Text: Faster Whisper (local transcription)
- Vision Understanding: Qwen3VL-2B/4B (local vision-language model)
- Language Model: Ollama (supports local and cloud deployment)
- Image Generation: Stable Diffusion / Qwen Image Edit
- Text-to-Speech: Multiple engines (Piper, Chatterbox, Soprano)
- Frontend: React + TypeScript + Vite
- Backend: FastAPI + WebSockets
git clone https://github.com/yourusername/multimodal-ai-assistant.git
cd multimodal-ai-assistant# Install Python dependencies
pip install -e .
# Frontend is automatically built during pip install
# If you need to rebuild manually:
cd frontend
npm install
npm run build
cd ..Copy the example environment file and configure it:
# Windows
copy src\aiassistant\.env.example src\aiassistant\.env
# Linux/Mac
cp src/aiassistant/.env.example src/aiassistant/.envEdit src/aiassistant/.env with your preferred settings. See Configuration Guide below for details.
See Model Setup section for detailed instructions on downloading and configuring models.
# Start the backend server
python -m aiassistant.app
# The application will be available at http://localhost:8000All configuration is done through environment variables in the .env file. Here are the key settings:
LOW_VRAM_MODE=true # Unloads models after use to save memoryOption 1: Local Ollama (Recommended)
LLM_HOST=http://localhost:11434
LLM_MODEL=llama3.2 # or mistral, qwen2.5, etc.
LLM_DEVICE=auto # auto, cuda, or cpu- Install Ollama from https://ollama.com/download
- Pull a model:
ollama pull llama3.2 - Start Ollama service
Option 2: Ollama Cloud
LLM_HOST=https://ollama.com
LLM_MODEL=glm-4.7:cloudGet an API key from https://ollama.com
Option 3: Custom Implementation
You can implement your own LLM provider by extending the base class in src/aiassistant/llm/base.py.
WHISPER_MODEL=distil-medium.en # or tiny.en, base.en, small.en, medium.en, large-v3
WHISPER_DEVICE=cuda # cuda or cpu
WHISPER_COMPUTE=float16 # float16, float32, or int8Models are automatically downloaded from HuggingFace on first run.
Option 1: Piper TTS (Default - Fast & Lightweight)
TTS_ENGINE=piper
PIPER_USE_CUDA=trueDownload voices from: https://huggingface.co/rhasspy/piper-voices/tree/main
Place .onnx and .json files in: src/models/voices/pipertts/
Option 2: Chatterbox TTS (Expressive)
TTS_ENGINE=chatterbox
CHATTERBOX_DEVICE=cudaRequires: pip install chatterbox-tts
Option 3: Soprano TTS (Fast & Lightweight)
TTS_ENGINE=soprano
SOPRANO_DEVICE=cudaRequires: pip install soprano-tts
IMAGEGEN_ENABLED=true
IMAGEGEN_MODEL=prompthero/openjourney # HuggingFace model ID or local path
IMAGEGEN_DEVICE=cuda
IMAGEGEN_WIDTH=512 # Lower for less VRAM (512x512 = ~6GB VRAM)
IMAGEGEN_HEIGHT=512
IMAGEGEN_STEPS=30 # 20-30 for speed, 40-50 for qualityPopular models:
prompthero/openjourney- Fast, small VRAMrunwayml/stable-diffusion-v1-5- General purposestabilityai/stable-diffusion-2-1- Better quality
Models are downloaded from HuggingFace on first run, or you can point to a local directory.
ComfyUI Integration (Coming Soon)
You can implement ComfyUI endpoints by extending src/aiassistant/imagegen/base.py.
IMAGEEXPLAINER_ENABLED=true
IMAGEEXPLAINER_MODEL=huihui-ai/Huihui-Qwen3-VL-2B-Instruct-abliterated
IMAGEEXPLAINER_DEVICE=autoModels:
huihui-ai/Huihui-Qwen3-VL-2B-Instruct-abliterated- 2B params, lower VRAMhuihui-ai/Huihui-Qwen3-VL-4B-Instruct-abliterated- 4B params, better quality
Models are downloaded from HuggingFace on first run.
Models are stored in src/models/:
src/models/
├── image_explainer/ # Vision-language models (auto-downloaded)
├── image_generation/ # Diffusion models (auto-downloaded)
├── stt/ # Whisper models (auto-downloaded)
├── tts/ # TTS models (auto-downloaded)
└── voices/
└── pipertts/ # Piper voice files (manual download)
- Visit https://huggingface.co/rhasspy/piper-voices/tree/main
- Choose a voice (e.g.,
en_US-lessac-medium) - Download both files:
en_US-lessac-medium.onnxen_US-lessac-medium.onnx.json
- Place in
src/models/voices/pipertts/
Models are automatically downloaded on first use. Specify the model name in .env:
WHISPER_MODEL=distil-medium.enOption 1: Auto-download from HuggingFace
IMAGEGEN_MODEL=prompthero/openjourneyOption 2: Use local model
IMAGEGEN_MODEL=/path/to/your/local/modelModels are automatically downloaded on first use:
IMAGEEXPLAINER_MODEL=huihui-ai/Huihui-Qwen3-VL-2B-Instruct-abliterated- Open http://localhost:8000 in your browser
- Click the microphone button to start voice input
- Speak naturally or type your message
- Attach images by clicking the image button
- The AI will respond with voice and/or images
See BUILD_INSTRUCTIONS.md for detailed build system information.
cd frontend
npm run dev # Starts dev server with hot reload at http://localhost:5173python -m aiassistant.app # Runs backend server- Privacy-conscious users wanting local AI assistants
- Anyone who wants a truly interactive AI companion
The ultimate vision is to have an animated 3D model of your choice that reacts and interacts during conversations, something similar to VTuber. Using technologies like Qwen ControlNet, image editing pipelines, PyGame, Wan, and all other opensource video/image generation models it should be possible to create a fully animated avatar that:
- Lip-syncs to generated speech
- Shows expressions and reactions based on conversation context
- Responds with gestures and body language (essentially a more advanced version of microsoft CLIPPY, yes I am OLD)
Features would include:
- Lip-sync to generated speech
- Facial expressions based on conversation context
- Gestures and body language
- Real-time emotion detection and response
This is a long-term pipe dream, but with current AI models advancing rapidly, it should become feasible soon. Think of it as a more advanced, AI-powered Microsoft Clippy! 📎
- Support Docker
- Multi-character support (character profiles)
- Conversation memory and context management
- Voice cloning for personalized TTS
- Custom image generation styles and LoRA support
- Export/import conversation history
- Plugin system for custom engines
Contributions are welcomed! Whether you want to:
- Add support for new LLM providers
- Implement new TTS/STT engines
- Integrate ComfyUI or other image generation backends
- Improve the UI/UX
- Fix bugs or add features
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Test thoroughly
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Claude for keeping up with my weird requests