- Introduction
- What is Chroma?
- Installation & Setup
- Core Concepts
- CareMind Architecture
- Step-by-Step Guide
- Common Operations
- Troubleshooting
- Advanced Topics
- Best Practices
This tutorial covers Chroma, the vector database system used in the CareMind clinical decision support application. Chroma enables fast semantic search over medical guidelines through dense vector embeddings.
What you'll learn:
- How to install and configure Chroma
- How to create and manage vector collections
- How to embed documents and perform semantic search
- How to integrate Chroma with Streamlit applications
- Troubleshooting and optimization strategies
Chroma is a lightweight, open-source vector database designed for embedding and semantic search. It provides:
- Persistent storage of document embeddings and metadata
- Fast similarity search using vector distances
- Flexible metadata filtering for structured queries
- Easy integration with Python applications
- Multi-collection support for organizing different datasets
In the CareMind project, Chroma serves as the Retrieval-Augmented Generation (RAG) backend:
- Medical guidelines are parsed into chunks
- Each chunk is embedded using a Chinese language model (e.g.,
BAAI/bge-large-zh-v1.5) - Embeddings are stored in a persistent Chroma collection
- When a doctor asks a clinical question, the query is embedded and semantically matched against stored guidelines
- Retrieved evidence snippets feed into an LLM for final clinical advice generation
Chroma provides:
- Efficient vector storage and indexing
- Similarity-based retrieval in milliseconds
- Metadata preservation (source, section, year, etc.)
- Seamless Streamlit integration via caching
- Python 3.8+ (3.10 recommended for CareMind)
- pip or conda package manager
- SQLite (included in Python, but may need shimming on some cloud platforms)
# Using pip
pip install chromadb
# Or with conda
conda install -c conda-forge chromadbCheck installation:
python -c "import chromadb; print(chromadb.__version__)"cd caremind-streamlit-dec
pip install -r requirements.txtThis installs:
chromadb— vector databasesentence-transformers— embedding modelstorch— deep learning backend (CPU or CUDA)streamlit— web UI framework- Other dependencies (pandas, numpy, pydantic, etc.)
Create a .env file in the project root:
cp .env.example .envEdit .env with your settings:
# Chroma persistence directory (will be created if it doesn't exist)
CHROMA_PERSIST_DIR=./chroma_store
# Collection name inside Chroma
CHROMA_COLLECTION=guideline_chunks_1024_v2
# Embedding model (Chinese-capable SentenceTransformer)
# Options: BAAI/bge-large-zh-v1.5, BAAI/bge-small-zh, sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_MODEL=BAAI/bge-large-zh-v1.5
# Optional: disable telemetry
CHROMA_TELEMETRY_OFF=1
# Optional: drug database path
DRUG_DB_PATH=./db/drugs.sqliteLoad environment variables in your shell:
export $(grep -v '^#' .env | xargs)Or use python-dotenv in your scripts (CareMind does this automatically).
# Test Chroma
python -c "
from chromadb import PersistentClient
client = PersistentClient(path='./chroma_store')
print('✓ Chroma client created successfully')
"
# Test embedding model
python -c "
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-zh')
embeddings = model.encode(['hello', 'world'])
print(f'✓ Model loaded, embedding dim: {embeddings.shape[1]}')
"A collection is a named container within Chroma that holds documents, embeddings, and metadata.
Example:
from chromadb import PersistentClient
client = PersistentClient(path='./chroma_store')
# Create or get a collection
collection = client.get_or_create_collection(
name="guideline_chunks",
metadata={"description": "Medical guidelines split into 1024-token chunks"}
)When you add a document to a collection:
- Text is embedded into a vector (list of floats)
- Vector and metadata are stored in the collection
- Document ID is indexed for fast retrieval
Structure:
# Adding documents
collection.add(
ids=["doc_1", "doc_2"],
documents=["Clinical guideline text 1...", "Clinical guideline text 2..."],
metadatas=[
{"source": "hypertension_guide.pdf", "section": "2.3", "year": 2024},
{"source": "diabetes_guide.pdf", "section": "1.1", "year": 2023}
]
)An embedding function converts text into dense vectors. CareMind uses SentenceTransformers, which are pre-trained on massive text corpora.
Chinese Models (recommended for CareMind):
BAAI/bge-large-zh-v1.5— 1024-dim, high quality (requires ~2GB VRAM)BAAI/bge-small-zh— 384-dim, lightweight (fits on 8GB VRAM)intfloat/e5-large-v2— 1024-dim, multilingual
How embeddings work:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
# Embed a single query
query = "高血压患者的降压目标是多少?"
query_embedding = model.encode(query)
print(f"Query embedding shape: {query_embedding.shape}") # (1024,)
# Embed a batch of documents
docs = ["文本1...", "文本2...", "文本3..."]
doc_embeddings = model.encode(docs)
print(f"Doc embeddings shape: {doc_embeddings.shape}") # (3, 1024)Chroma computes cosine similarity between query and document vectors to find the most relevant results.
Algorithm:
- Embed the query:
query_vec = embed(query) - For each document:
similarity = cosine(query_vec, doc_vec) - Return top-K documents sorted by similarity
Example:
results = collection.query(
query_texts=["高血压治疗"],
n_results=5
)
# results = {
# "ids": [["id_1", "id_2", "id_3", "id_4", "id_5"]],
# "documents": [["doc_text_1", "doc_text_2", ...]],
# "distances": [[0.1, 0.2, 0.3, 0.4, 0.5]],
# "metadatas": [[{...}, {...}, ...]]
# }Chroma stores metadata alongside embeddings, enabling filtered search.
Supported metadata types:
- Strings:
"source": "guidelines.pdf" - Numbers:
"year": 2024,"section_id": 3 - Booleans:
"is_critical": true - Lists of strings/numbers:
"keywords": ["hypertension", "cardiovascular"]
Filtering example:
# Query with metadata filter
results = collection.query(
query_texts=["降压策略"],
n_results=5,
where={"year": {"$gte": 2020}} # Only recent guidelines
)Chroma saves collections to disk in a directory (e.g., ./chroma_store), so data persists across application restarts.
Directory structure:
chroma_store/
├── chroma-*.db # SQLite database files
├── <collection_uuid>/ # Collection-specific directory
│ └── ... # Embedding indexes and metadata
┌────────────────────┐
│ User Query (UI) │
└──────────┬─────────┘
│
▼
┌────────────────────────────────────────┐
│ 1. Embed Query (SentenceTransformer) │
│ (EMBEDDING_MODEL environment var) │
└──────────┬─────────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ 2. Chroma Vector Search │
│ (CHROMA_PERSIST_DIR) │
│ (CHROMA_COLLECTION) │
└──────────┬─────────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ 3. Retrieved Evidence Chunks │
│ + metadata (source, section, etc) │
└──────────┬─────────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ 4. LLM Orchestrator (pipeline.py) │
│ Integrates evidence into prompt │
└──────────┬─────────────────────────────┘
│
▼
┌────────────────────────────────────────┐
│ 5. Clinical Advice (with citations) │
└────────────────────────────────────────┘
| File | Purpose |
|---|---|
rag/retriever.py |
Chroma client/collection management, search API |
ingest/build_vectors.py |
Parse JSONL → embed chunks → write to Chroma |
rag/pipeline.py |
Orchestrate retriever + LLM for final advice |
tools/check_chroma.py |
Diagnostic tool to inspect Chroma health |
app.py |
Streamlit UI that calls retriever.search_guidelines() |
Convert raw PDF/text guidelines into structured JSONL:
python ingest/parse_docs.py \
--in data/guidelines/ \
--out data/guidelines.parsed.jsonl \
--chunk-size 1024Input: Directory of PDF or text files
Output: JSONL file with structure:
{"content": "chunk text here", "meta": {"source": "...", "year": 2024, ...}}
{"content": "chunk text here", "meta": {"source": "...", "year": 2024, ...}}# Check first few lines
head data/guidelines.parsed.jsonl
# Count total chunks
wc -l data/guidelines.parsed.jsonlexport CHROMA_PERSIST_DIR=./chroma_store
export CHROMA_COLLECTION=guideline_chunks_1024_v2
export EMBEDDING_MODEL=BAAI/bge-large-zh-v1.5
export EMBED_BATCH_SIZE=16# For GPU (fast)
python ingest/build_vectors.py
# For CPU (slow but works)
export DEVICE=cpu
python ingest/build_vectors.pyWhat this script does:
- Reads JSONL file line-by-line
- Embeds document chunks in batches
- Sanitizes metadata (handles complex types)
- Writes embeddings to Chroma with metadata
- Handles OOM gracefully (batch backoff)
Progress output:
📦 Persist dir: /home/user/caremind-streamlit-dec/chroma_store
🗃️ Collection: guideline_chunks_1024_v2
🔧 Embedding model: BAAI/bge-large-zh-v1.5
⚙️ Device: cuda
📚 Reading JSONL: 100%|████████| 5234/5234 [12:34<00:00, 6.94 docs/s]
✅ Upserted 5234 documents
python tools/check_chroma.py --q "高血压 降压目标"Expected output:
──────── Health Check ────────
Chroma dir: ./chroma_store
Collection: guideline_chunks_1024_v2
Embedding mdl: BAAI/bge-large-zh-v1.5
Chroma version: 0.4.24
✔️ Collections found:
- guideline_chunks_1024_v2 (5234 documents)
✔️ 能命中,Top1 预览:
高血压患者的降压目标应该根据不同的临床情况确定……
耗时:0.45s
from rag import retriever as R
# Search for relevant guidelines
query = "高血压患者合并糖尿病应该如何管理血压?"
hits = R.search_guidelines(query, k=5)
for i, hit in enumerate(hits, 1):
print(f"{i}. Score: {hit['score']:.3f}")
print(f" Source: {hit.get('source', 'unknown')}")
print(f" Text: {hit['text'][:200]}...\n")The Streamlit UI automatically uses Chroma via rag.retriever:
import streamlit as st
from rag import retriever as R
# User input
query = st.text_area("Clinical Question")
if st.button("Search Guidelines"):
# Retrieve evidence
hits = R.search_guidelines(query, k=5)
# Display results
for hit in hits:
with st.expander(f"📄 {hit.get('source', 'Unknown')}"):
st.write(hit['text'])
st.write(f"**Similarity Score:** {hit['score']:.3f}")- Push repository to GitHub
- Connect to Streamlit Cloud
- Set environment variables in Streamlit Cloud dashboard:
CHROMA_PERSIST_DIR=./chroma_store CHROMA_COLLECTION=guideline_chunks_1024_v2 EMBEDDING_MODEL=BAAI/bge-large-zh-v1.5 - Ensure
chroma_store/is committed to Git (or use a separate branch)
# Start Streamlit app
streamlit run app.py \
--server.port 8501 \
--logger.level=info
# Or with systemd/supervisor for auto-restartfrom chromadb import PersistentClient
client = PersistentClient(path='./chroma_store')
collections = client.list_collections()
for col in collections:
print(f"Collection: {col.name}")
print(f" Count: {col.count()}")
print(f" Metadata: {col.metadata}")collection = client.get_collection("guideline_chunks_1024_v2")
# Simple query
results = collection.query(
query_texts=["Hypertension treatment"],
n_results=10
)
# Query with metadata filter
results = collection.query(
query_texts=["Hypertension treatment"],
n_results=10,
where={"year": {"$gte": 2020}}
)
# Query with multiple conditions
results = collection.query(
query_texts=["Hypertension treatment"],
n_results=10,
where={
"$and": [
{"year": {"$gte": 2020}},
{"source": {"$in": ["hypertension_guide.pdf", "cardio_guide.pdf"]}}
]
}
)collection.add(
ids=["doc_1", "doc_2", "doc_3"],
documents=[
"Clinical guideline text 1...",
"Clinical guideline text 2...",
"Clinical guideline text 3..."
],
metadatas=[
{"source": "guide.pdf", "section": "2.1", "year": 2024},
{"source": "guide.pdf", "section": "2.2", "year": 2024},
{"source": "guide.pdf", "section": "2.3", "year": 2024}
]
)# Upsert (update if exists, insert if new)
collection.upsert(
ids=["doc_1"],
documents=["Updated text..."],
metadatas=[{"source": "guide.pdf", "updated": True}]
)# Delete by ID
collection.delete(ids=["doc_1", "doc_2"])
# Delete by metadata filter
collection.delete(where={"source": "old_guide.pdf"})result = collection.get(ids=["doc_1"])
print(result['documents'][0])
print(result['metadatas'][0])client.delete_collection("old_collection")Cause: Chroma not installed
Solution:
pip install chromadbCause: Multiple Chroma clients created with different Settings
Solution: In CareMind, use the singleton pattern via retriever.get_chroma_client():
from rag import retriever as R
# Uses cached, single client
client = R.get_chroma_client()
collection = R.get_chroma_collection()Cause: Collection name mismatch or incorrect persist directory
Solution:
# Check what collections exist
python tools/check_chroma.py
# Verify environment variables
echo $CHROMA_PERSIST_DIR
echo $CHROMA_COLLECTIONCauses & Solutions:
- Large embedding dimension: Use a smaller model
export EMBEDDING_MODEL=BAAI/bge-small-zh # 384-dim instead of 1024
- Too many documents: Rebuild with smaller chunk size or filter metadata
- CPU embedding: Use GPU
export DEVICE=cuda
Cause: Batch size too large for GPU
Solution: Lower batch size
export EMBED_BATCH_SIZE=8 # default 16
python ingest/build_vectors.pyOr use CPU fallback:
export OOM_CPU_FALLBACK=1
python ingest/build_vectors.pyCause: Changed embedding model after building collection
Example: Built with BAAI/bge-large-zh-v1.5 (1024-dim) but querying with BAAI/bge-small-zh (384-dim)
Solution: Rebuild entire collection with consistent model
export EMBEDDING_MODEL=BAAI/bge-large-zh-v1.5
python ingest/create_db.py --in data/guidelines.parsed.jsonlCause: Concurrent access to SQLite database
Solution: Use pysqlite3-binary in requirements.txt for Streamlit Cloud
pip install pysqlite3-binaryCreate a custom embedding function if needed:
from chromadb.api.types import EmbeddingFunction
class CustomEmbedder(EmbeddingFunction):
def __init__(self, model_name):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer(model_name)
def __call__(self, input):
# input is list of strings
return self.model.encode(input).tolist()
# Use in collection
embedder = CustomEmbedder("BAAI/bge-large-zh-v1.5")
collection = client.get_or_create_collection(
name="my_collection",
embedding_function=embedder
)# OR condition
results = collection.query(
query_texts=["hypertension"],
where={
"$or": [
{"source": "guide1.pdf"},
{"source": "guide2.pdf"}
]
}
)
# NOT condition
results = collection.query(
query_texts=["hypertension"],
where={
"$not": {"source": "outdated_guide.pdf"}
}
)
# String contains
results = collection.query(
query_texts=["hypertension"],
where={"source": {"$regex": ".*guide.*"}}
)from tqdm import tqdm
docs = load_all_documents() # list of {text, meta}
batch_size = 100
for i in tqdm(range(0, len(docs), batch_size)):
batch = docs[i:i+batch_size]
collection.upsert(
ids=[d["id"] for d in batch],
documents=[d["text"] for d in batch],
metadatas=[d["meta"] for d in batch]
)After Chroma retrieves top-K candidates, rerank with a more powerful model:
from sentence_transformers import CrossEncoder
# Initial retrieval (fast)
results = collection.query(query_texts=[query], n_results=50)
# Rerank (slower but more accurate)
reranker = CrossEncoder('BAAI/bge-reranker-large')
documents = results['documents'][0]
scores = reranker.predict([[query, doc] for doc in documents])
# Sort by reranker scores
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
top_5 = ranked[:5]import time
start = time.time()
results = collection.query(query_texts=[query], n_results=5)
latency_ms = (time.time() - start) * 1000
print(f"Query latency: {latency_ms:.1f}ms")
print(f"Results returned: {len(results['documents'][0])}")
print(f"Average score: {sum(results['distances'][0]) / len(results['distances'][0]):.3f}")Always externalise configuration:
import os
persist_dir = os.getenv("CHROMA_PERSIST_DIR", "./chroma_store")
collection_name = os.getenv("CHROMA_COLLECTION", "guideline_chunks")
embedding_model = os.getenv("EMBEDDING_MODEL", "BAAI/bge-large-zh-v1.5")In Streamlit, use caching to avoid recreating Chroma clients:
import streamlit as st
from chromadb import PersistentClient
@st.cache_resource
def get_chroma_client():
return PersistentClient(path="./chroma_store")
@st.cache_resource
def get_collection():
client = get_chroma_client()
return client.get_collection("guideline_chunks")Ensure metadata is scalar or JSON-serializable:
def sanitize_metadata(meta):
"""Convert complex types to strings for Chroma."""
sanitized = {}
for key, value in meta.items():
if isinstance(value, (str, int, float, bool)):
sanitized[key] = value
elif isinstance(value, list):
sanitized[key] = [str(v) for v in value] # convert to strings
else:
sanitized[key] = str(value) # fallback
return sanitizedimport logging
logger = logging.getLogger(__name__)
def search_with_logging(query, k=5):
logger.info(f"Query: {query} | k={k}")
results = collection.query(query_texts=[query], n_results=k)
logger.info(f"Results: {len(results['documents'][0])} docs retrieved")
logger.debug(f"Scores: {results['distances'][0]}")
return resultsresults = collection.query(query_texts=[query], n_results=5)
if not results['documents'] or not results['documents'][0]:
st.warning("No relevant guidelines found. Please refine your search.")
else:
for i, (doc, score, meta) in enumerate(zip(
results['documents'][0],
results['distances'][0],
results['metadatas'][0]
)):
st.write(f"**Result {i+1}** (Score: {score:.3f})")
st.write(doc)
st.caption(f"Source: {meta.get('source', 'unknown')}")Track collection configurations in your code:
COLLECTION_VERSION = "guideline_chunks_1024_v2" # Include embedding dim and version
EMBEDDING_MODEL = "BAAI/bge-large-zh-v1.5" # Lock to specific model
# Document why this version exists
"""
Changelog:
- v1: Initial collection (2024-01-15) | BAAI/bge-small-zh | 384-dim | 3000 docs
- v2: Added 2024 guidelines (2024-03-20) | BAAI/bge-large-zh-v1.5 | 1024-dim | 5234 docs
"""def health_check(collection):
"""Verify collection is healthy."""
try:
# Can we query?
test_query = "test"
results = collection.query(query_texts=[test_query], n_results=1)
# Count documents
count = collection.count()
return {
"status": "healthy",
"document_count": count,
"can_query": True
}
except Exception as e:
return {
"status": "unhealthy",
"error": str(e)
}
health = health_check(collection)
print(health)# Backup Chroma directory
cp -r chroma_store chroma_store.backup.$(date +%Y%m%d)
# Or use cloud storage
aws s3 sync chroma_store s3://my-bucket/chroma-backups/Chroma is a powerful, lightweight vector database that enables semantic search in CareMind. Key takeaways:
✅ Install & configure via environment variables
✅ Embed documents using Chinese-capable language models
✅ Query collections with semantic similarity search
✅ Integrate with Streamlit using caching and RAG patterns
✅ Monitor health with diagnostic tools
✅ Scale responsibly by optimizing batch sizes and models
For more information, see:
Last Updated: December 2024
Authors: CareMind Development Team