An interactive image manipulation tool that combines Meta's Segment Anything Model (SAM) with Stable Diffusion for precise, AI-powered image editing. Click on any object in an image to segment it, then use natural language prompts to transform it into something new.
- Interactive Segmentation: Click on any object in an image to automatically segment it using SAM
- AI-Powered Inpainting: Use text prompts to generate new content in the masked region
- ControlNet Integration: Leverage semantic segmentation for better control over generation
- Background Mode: Option to inpaint backgrounds instead of foreground objects
- Real-time Preview: See masks and segmentation results instantly
- Web Interface: User-friendly Gradio interface accessible via browser
User Input Image → SAM Segmentation → Mask Generation → ControlNet + Stable Diffusion → Output Image
- Image Upload & Selection: User uploads an image and clicks on objects to edit
- Segmentation (SAM):
- Converts point clicks into precise segmentation masks using Vision Transformer (ViT-H)
- Generates automatic semantic segmentation of the entire scene
- Mask Processing: Creates boolean masks and colored segmentation maps
- Inpainting (Stable Diffusion + ControlNet):
- ControlNet conditions generation using semantic segmentation
- Stable Diffusion inpaints masked regions based on text prompts
- UniPC scheduler optimizes generation (20 inference steps)
- Output: Seamlessly edited image with natural blending
Segment Anything Model (SAM)
- Model:
sam_vit_h_4b8939.pth(ViT-Huge backbone) - Function: Point-based segmentation and automatic mask generation
- Output: Precise pixel-level masks and semantic segmentation maps
Stable Diffusion Inpainting
- Model:
runwayml/stable-diffusion-inpainting - Function: Generate new content in masked regions
- Features: Text-guided, context-aware inpainting
ControlNet
- Model:
lllyasviel/sd-controlnet-seg - Function: Semantic control for stable generation
- Benefit: Maintains structure and coherence in edits
git clone https://github.com/bhavyashah10/Image-Manipulation-SAM.git
cd Image-Manipulation-SAMOption A: Using Conda (Recommended)
conda create -n sam-sd python=3.10
conda activate sam-sdOption B: Using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateUse requirements file:
pip install -r requirements.txtor
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install gradio numpy pillow diffusers transformers accelerate segment-anythingDownload the SAM checkpoint:
wget https://huggingface.co/spaces/abhishek/StableSAM/blob/main/sam_vit_h_4b8939.pthOr manually download from here and place in the project root.
Before running, fix the typo in app.py line 13:
# Change this:
device = "cp `u"
# To this:
device = "cuda" # or "cpu" if no GPU availableGradio Web Interface (Recommended)
python app.pyThe interface will launch at http://localhost:7860. For remote access, a public URL will be generated.
Jupyter Notebook
jupyter notebook Diffusion_with_sam.ipynbObject Replacement
1. Upload: car.jpeg
2. Click: on the car
3. Prompt: "a yellow taxi cab"
4. Result: Car transformed into a taxi
Style Transfer
1. Upload: any image
2. Select: Background checkbox
3. Prompt: "sunset beach, golden hour"
4. Result: Background changed to beach scene
Object Modification
1. Upload: girl.png
2. Click: on clothing
3. Prompt: "wearing a red dress"
4. Result: Outfit changed to red dress
Image-Manipulation-SAM/
├── app.py # Main Gradio application
├── controlnet_inpaint.py # Custom ControlNet inpainting pipeline
├── Diffusion_with_sam.ipynb # Jupyter notebook demo
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── Test Images/ # Sample test images
Modify inference steps for quality vs speed tradeoff:
# In app.py, inpaint function
output = pipe(
prompt,
image,
mask,
seg_img,
negative_prompt=negative_prompt,
num_inference_steps=20, # Increase for better quality (20-50)
)Adjust processing resolution:
# In app.py, inpaint function
image = image.resize((512, 512)) # Change to (768, 768) for higher resolution1. CUDA Out of Memory
# Solution 1: Use CPU
device = "cpu"
# Solution 2: Reduce image size
image = image.resize((384, 384)) # Smaller resolution
# Solution 3: Clear GPU cache
torch.cuda.empty_cache()2. SAM Model Not Found
# Verify file exists
ls -lh sam_vit_h_4b8939.pth
# Re-download if needed
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth3. Slow Generation
- Ensure GPU is being used: Check
device = "cuda"in app.py - Close other GPU-intensive applications
- Reduce
num_inference_stepsto 15-20
4. Poor Quality Results
- Increase
num_inference_stepsto 30-50 - Improve prompt specificity
- Use detailed negative prompts
- Try re-selecting the object with better clicks
5. Import Errors
# Reinstall dependencies
pip install --upgrade diffusers transformers accelerateRecommended:
- Run it on Google Collab
Optimal (for Local usage):
- GPU: 12GB+ VRAM (for 768x768 images)
- RAM: 32GB system memory