EmbeddingGenerator.sex_text_model does not update the model ; remains "fastembed" #160
-
|
Hi ! We noticed that when using Using from semantica.embeddings import TextEmbedder
generator = TextEmbedder(method="sentence_transformers", model_name="dangvantuan/sentence-camembert-base")
print(f"Current method: {generator.get_method()}")
# Prepare texts for embedding generation
print("Preparing texts for embedding...")
texts = ["Un avion est en train de décoller.",
"Un homme joue d'une grande flûte.",
"Un homme étale du fromage râpé sur une pizza.",
"Une personne jette un chat au plafond.",
"Une personne est en train de plier un morceau de papier.",
]
try:
print(f"methods info : {generator.get_model_info()}")
# Generate embeddings using the configured embedding generator
print("Generating embeddings...")
embeddings = generator.embed_batch(texts, show_progress_bar=True)
print(f"Embeddings generated successfully:")
print(f" - Total embeddings: {len(embeddings)}")
print(f" - Embedding dimension: {embeddings.shape[1] if len(embeddings) > 0 else 0}")
print(embeddings)
except ImportError:
print("Error occured ")Output : Current method: sentence_transformers
Preparing texts for embedding...
methods info : {'method': 'sentence_transformers', 'model_name': 'dangvantuan/sentence-camembert-base', 'model_loaded': True, 'dimension': 768, 'normalize': True, 'device': 'cpu'}
Generating embeddings...
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.46it/s]
Embeddings generated successfully:
- Total embeddings: 5
- Embedding dimension: 768However, when using config_dict = {
"project_name": "Semantica_Test_Project",
# Embedding configuration
"embedding": {
"provider": "sentence_transformers",
"model": "dangvantuan/sentence-camembert-base" # 768-dimensional embeddings
},
# Extraction configuration (for NER and relation extraction)
"extraction": {
"provider": "groq",
"model": "llama-3.1-8b-instant",
"temperature": 0.0 # Deterministic extraction
},
# Inference configuration (for answer generation)
"inference": {
"provider": "groq",
"model": "llama-3.3-70b-versatile"
},
# Vector store configuration
"vector_store": {
"provider": "faiss",
"dimension": 768 # Must match embedding dimension
},
# Knowledge graph configuration
"knowledge_graph": {
"backend": "networkx",
"merge_entities": True # Automatically merge duplicate entities
}
}
from semantica.core import Semantica, ConfigManager
# Load configuration and initialize Semantica core
config = ConfigManager().load_from_dict(config_dict)
core = Semantica(config=config)
print(f"Config directory : {config_dict}\n")
print("Preparing texts for embedding...")
texts = ["Un avion est en train de décoller.",
"Un homme joue d'une grande flûte.",
"Un homme étale du fromage râpé sur une pizza.",
"Une personne jette un chat au plafond.",
"Une personne est en train de plier un morceau de papier.",
]
core.embedding_generator.set_text_model(method="sentence_transformers", model_name="dangvantuan/sentence-camembert-base")
print(f" - Total chunks to embed: {len(texts)}")
print(f" - Text method : {core.embedding_generator.get_text_method()}")
print(f" - Methods info: {core.embedding_generator.get_methods_info()}")
print(f" - Expected dimension: 768")
print("\n")
embeddings = core.embedding_generator.generate_embeddings(texts, data_type="text")
print(f"Embeddings generated successfully:")
print(f" - Total embeddings: {len(embeddings)}")
print(f" - Embedding dimension: {embeddings.shape[1] if len(embeddings) > 0 else 0}")
print("\n")
print(embeddings)output : Config directory : {'project_name': 'Semantica_Test_Project', 'embedding': {'provider': 'sentence_transformers', 'model': 'dangvantuan/sentence-camembert-base'}, 'extraction': {'provider': 'groq', 'model': 'llama-3.1-8b-instant', 'temperature': 0.0}, 'inference': {'provider': 'groq', 'model': 'llama-3.3-70b-versatile'}, 'vector_store': {'provider': 'faiss', 'dimension': 768}, 'knowledge_graph': {'backend': 'networkx', 'merge_entities': True}}
Preparing texts for embedding...
- Total chunks to embed: 5
- Text method : fastembed
- Methods info: {'text': {'method': 'fastembed', 'model_name': 'dangvantuan/sentence-camembert-base', 'model_loaded': True, 'dimension': 384, 'normalize': True}}
- Expected dimension: 768
Embeddings generated successfully:
- Total embeddings: 5
- Embedding dimension: 384
Even though the expected embedding dimension is 768, the generator still produces embeddings of dimension 384 and the method remains We looked up inside the code to understand, but the |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
Hi @MoktarEls Thank you so much for the detailed bug report—your investigation was incredibly helpful in identifying the root cause! The FixWe've resolved the issue in the Example# Switch to sentence_transformers
core.embedding_generator.set_text_model(
method="sentence_transformers",
model_name="dangvantuan/sentence-camembert-base"
)
# It now correctly reflects the change:
info = core.embedding_generator.get_methods_info()
print(f"Method: {info['text']['method']}") # sentence_transformers
print(f"Dimension: {info['text']['dimension']}") # 768How to VerifyYou can install the latest changes directly from the pip install git+https://github.com/Hawksight-AI/semantica.git@mainAlternatively, you can clone the repository and switch to the branch: git clone https://github.com/Hawksight-AI/semantica.git
git checkout embeddingsWe've also included a new test suite in All these fixes will be officially released in the next version of Thanks again for helping us improve the project |
Beta Was this translation helpful? Give feedback.
Hi @MoktarEls Thank you so much for the detailed bug report—your investigation was incredibly helpful in identifying the root cause!
The Fix
We've resolved the issue in the
embeddingsbranch (associated with PR #160). TheTextEmbeddernow correctly resets its internal state and dynamically detects embedding dimensions when switching models.Example