How Ormah Works
Embeddings System
Content verified · 2026-04-13
The embeddings subsystem powers vector search, similarity detection, and cross-encoder reranking. It also provides the encoder used by prompt-intent classification. Most of the implementation lives in src/ormah/embeddings/.
Architecture
graph TD
subgraph "Public API"
ENC["get_encoder()<br/>encoder.py"]
end
subgraph "Adapters (pluggable)"
LOCAL["LocalAdapter<br/>FastEmbed / BGE<br/>CPU-only, ~420MB"]
OLLAMA["OllamaAdapter<br/>HTTP to localhost:11434"]
LITELLM["LiteLLMAdapter<br/>OpenAI, Gemini, Voyage, etc."]
end
subgraph "Storage"
VS["VectorStore<br/>sqlite-vec virtual table"]
end
subgraph "Search"
HS["HybridSearch<br/>FTS + Vector + RRF"]
end
subgraph "Precision"
RERANK["Reranker<br/>Cross-encoder MiniLM"]
end
ENC --> LOCAL
ENC --> OLLAMA
ENC --> LITELLM
LOCAL --> VS
OLLAMA --> VS
LITELLM --> VS
VS --> HS
HS --> RERANK
style ENC fill:#74b3a5,color:#000
Embedding Adapter Interface
Code: embeddings/base.py
class EmbeddingAdapter(ABC):
@abstractmethod
def encode(self, text: str) -> np.ndarray:
"""Single text → L2-normalized vector"""
@abstractmethod
def encode_batch(self, texts: list[str], batch_size: int = 32) -> np.ndarray:
"""Batch encoding → normalized vectors"""
def encode_query(self, text: str) -> np.ndarray:
"""Query-specific encoding (may add prefix)"""
@property
@abstractmethod
def dim(self) -> int:
"""Vector dimensionality"""
Local Adapter (Default)
Code: embeddings/local_adapter.py
- Model: BAAI/bge-base-en-v1.5
- Dimensions: 768
- Library: FastEmbed (ONNX runtime, CPU-only)
- Size: ~420MB download on first use
- Cache:
~/.local/share/ormah/models/
# Query encoding adds special prefix automatically
model.query_embed("what database does ormah use?")
# → internally: "Represent this sentence for searching relevant passages: what database does ormah use?"
# Document encoding (no prefix)
model.embed("Chose SQLite over Postgres for local-first design")
Lazy loading: Model is downloaded and loaded on first encode() call. Subsequent calls use a module-level _model_cache singleton keyed by model name.
Why BGE?
- Runs entirely on CPU (no GPU needed)
- 768 dimensions gives good precision without excessive storage
- Asymmetric retrieval: different encoding for queries vs documents improves recall
- ~420MB is acceptable for a local-first tool
Ollama Adapter
Code: embeddings/ollama_adapter.py
For users running Ollama locally:
# HTTP POST to http://localhost:11434/api/embed
response = httpx.post(f"{base_url}/api/embed", json={
"model": "nomic-embed-text", # default
"input": [text]
})
- Batch processing with 32-item chunks
- Auto-normalization of returned vectors
LiteLLM Adapter
Code: embeddings/litellm_adapter.py
For cloud-hosted embeddings (OpenAI, Gemini, Voyage, Mistral):
# Uses litellm's unified interface
response = litellm.embedding(model="text-embedding-3-small", input=[text])
- Batch processing with 32-item chunks
- Auto-normalization
Encoder Factory
Code: embeddings/encoder.py, embeddings/__init__.py
def get_encoder(settings=None) -> EmbeddingAdapter:
if settings is None:
settings = default_settings
return get_adapter(settings)
def get_adapter(settings) -> EmbeddingAdapter:
if settings.embedding_provider == "local":
return LocalAdapter(model_name=settings.embedding_model)
if settings.embedding_provider == "ollama":
return OllamaEmbeddingAdapter(
model=settings.embedding_model,
base_url=settings.llm_base_url,
dim=settings.embedding_dim,
)
if settings.embedding_provider == "litellm":
return LiteLLMEmbeddingAdapter(
model=settings.embedding_model,
dim=settings.embedding_dim,
)
The current implementation caches one adapter per settings-object identity in a module-level _adapter_cache.
Vector Store
Code: embeddings/vector_store.py
Wraps the sqlite-vec extension for vector storage and KNN search:
The backing store is the node_vectors sqlite-vec virtual table in the derived SQLite index.
class VectorStore:
def upsert(self, node_id: str, embedding: np.ndarray):
"""DELETE + INSERT in single transaction"""
# Atomic: prevents reader seeing missing row
def search(self, query_vec: np.ndarray, k: int) -> list[dict]:
"""KNN via sqlite-vec MATCH operator"""
# SELECT id, distance FROM node_vectors
# WHERE embedding MATCH ? AND k = ?
# Returns L2 distance, converted:
# cosine_similarity = 1 - (distance² / 2)
def upsert_batch(self, items: list[tuple[str, np.ndarray]]):
"""Batch upsert in single transaction"""
L2 to Cosine Conversion
sqlite-vec uses L2 (Euclidean) distance internally. For L2-normalized vectors (unit vectors), cosine similarity can be derived:
cosine_similarity = 1 - (L2_distance² / 2)
This works because for unit vectors: ||a - b||² = 2 - 2·cos(a,b), so cos(a,b) = 1 - ||a-b||²/2.
Cross-Encoder Reranker
Code: embeddings/reranker.py
A precision filter used exclusively in the whisper pipeline. Unlike embeddings (which encode query and document separately), the cross-encoder sees both together.
How It Works
flowchart LR
subgraph "Bi-Encoder (fast, recall)"
Q1[Query] --> ENC1[Encoder]
D1[Document] --> ENC2[Encoder]
ENC1 --> SIM["Cosine similarity<br/>score: 0.72"]
ENC2 --> SIM
end
subgraph "Cross-Encoder (slow, precision)"
Q2[Query] --> CE["Cross-Encoder<br/>sees both together"]
D2[Document] --> CE
CE --> SCORE["Relevance score<br/>score: +2.3"]
end
Model: cross-encoder/ms-marco-MiniLM-L-6-v2
- Input: (query, document) pairs
- Output: raw score in [-12, +6] range
- Positive = relevant, negative = irrelevant
Linear Rescale
def _linear_rescale(ce_score: float) -> float:
"""Maps [-12, +6] → [0, 1] linearly"""
return max(0.0, min(1.0, (ce_score - (-12)) / (6 - (-12))))
Linear rescale (not sigmoid) is used because it preserves the cross-encoder's ability to strongly suppress irrelevant results. Sigmoid would flatten the extremes, reducing discrimination.
Blending
final_score = alpha * linear_rescale(ce_score) + (1 - alpha) * embedding_score
# alpha = 0.6 (60% cross-encoder weight)
Model Cache
Code: embeddings/cache.py
def get_fastembed_cache_dir() -> Path:
"""~/.local/share/ormah/models/ or FASTEMBED_CACHE_PATH env"""
def is_model_cached(model_name: str) -> bool:
"""Check if model directory exists"""
During ormah setup, both the embedding model and reranker model are preloaded so the first whisper call doesn't have a cold-start delay.
Content Truncation
Before embedding, content is truncated to embedding_max_content_chars (default: 512 characters). This prevents long documents from getting "averaged out" embeddings that match everything weakly.
text_to_embed = f"{title}\n{content[:512]}" if title else content[:512]
Walkthrough: Encoding and Searching
Say we store: "Chose SQLite over Postgres because local-first doesn't need a server database"
- Encoding:
LocalAdapter.encode("Chose SQLite over Postgres...")→ 768-dim normalized vector - Storage:
VectorStore.upsert(node_id, vector)→ INSERT into sqlite-vec - Later, search: User asks "what database does ormah use?"
- Query encoding:
LocalAdapter.encode_query("what database does ormah use?")→ 768-dim query vector (with "Represent this sentence..." prefix) - KNN search:
VectorStore.search(query_vec, k=30)→ returns L2 distances - Convert:
cosine_sim = 1 - (distance² / 2)→ 0.82 for "Chose SQLite..." - Threshold: 0.82 > 0.4 → passes
- Feed into HybridSearch: Combined with FTS5 results via RRF (see 03 - Search and Ranking)