How it works.
saavos is a retrieval-augmented generation (RAG) system. Your content is chunked, embedded into vectors, and stored. At chat time, the visitor's question is embedded and matched against your vectors. The top matching chunks are sent to Claude with the question. Claude answers from your content, not from the internet.
System diagram.
Bot owner (dashboard)
│
├─ Add source (URL / PDF / text)
│ └─► Ingest worker
│ ├─ Fetch content (10s timeout, follow 5 redirects)
│ ├─ Extract readable text (Mozilla Readability)
│ ├─ Clean (strip nav/footer, collapse whitespace)
│ ├─ Chunk (500 token target, 50 token overlap, 600 max)
│ ├─ Embed (OpenAI text-embedding-3-small, 1536 dims, batches of 100)
│ └─ Store vectors in Supabase pgvector (HNSW index)
│
└─ Customize bot (name, persona, fallback message)
Site visitor (embed widget)
│
└─ Send message
└─► POST /api/chat
├─ Verify HMAC session token
├─ Embed question (OpenAI text-embedding-3-small)
├─ Cosine similarity search (top 5, threshold ≥ 0.3)
├─ If 0 chunks above threshold → return fallback message (no Anthropic call)
├─ Assemble prompt (system + persona + chunks + history + question)
├─ Stream response from Anthropic Claude
└─ Return NDJSON stream (token events + citations + done)How your content is processed.
Fetch
For URL sources, saavos fetches the page server-side with a 10-second timeout, following up to 5 redirects. It uses Mozilla Readability — the same algorithm as Firefox Reader Mode — to extract the main article body and discard navigation, ads, and boilerplate.
PDFs are uploaded to Supabase Storage and parsed with pdf-parse. Scanned (image-only) PDFs cannot be processed — they contain no extractable text. Use a text-based export or paste the content directly.
Plain text is accepted as-is. No fetching or parsing required.
Clean and chunk
After extraction, the text is cleaned: whitespace is normalized, HTML entities are decoded, and heading structure is preserved as metadata for each chunk.
Text is then split into chunks using a recursive character splitter that respects heading and paragraph boundaries:
Chunk size target
500 tokens
Hard maximum
600 tokens — chunks above this are split even mid-sentence
Overlap
50 tokens — the last 50 tokens of chunk N become the first 50 tokens of chunk N+1. This prevents answer boundaries from splitting mid-concept.
Metadata stored per chunk
source_id, chunk_index, heading_path, token_count
Embed
Each chunk is sent to OpenAI's text-embedding-3-small model in batches of 100. This produces a 1536-dimensional vector representing the semantic meaning of the chunk.
Embeddings are stored in Supabase Postgres via the pgvector extension in a column typed vector(1536). An HNSW index makes similarity searches fast even at tens of thousands of chunks.
If embedding fails after 3 exponential-backoff retries (1s, 2s, 4s), the source is marked error and no partial vectors are kept.
How chat finds answers.
Embed the question
When a visitor sends a message, the question is embedded with the same text-embedding-3-small model used at ingest time. This produces a 1536-dimensional query vector.
Vector search
A cosine similarity search runs over all chunks belonging to the bot, filtered strictly by bot_id — no cross-tenant leakage is possible. The top 5 chunks by similarity score are retrieved.
Chunks with cosine similarity below 0.3 are discarded. If all 5 candidates fall below this threshold, the Anthropic call is skipped entirely and the bot returns its configured fallback message. This costs zero generation tokens and prevents hallucinated answers to out-of-scope questions.
Generate
The retrieved chunks, the bot's system prompt, persona, the last 10 turns of conversation history, and the visitor's question are assembled into a prompt. This is sent to Anthropic Claude with streaming enabled.
The model is instructed to answer only from the provided context. If the context does not contain the answer, the model says so — it does not guess. Citation references are included in the response stream.
Pinned parameters.
Embedding model
OpenAI text-embedding-3-small, 1536 dimensions
Generation model
Anthropic claude-sonnet-4-6 (default). Configurable per bot.
Chunk size
500 token target, 600 token hard max, 50 token overlap
Retrieval K
Top 5 chunks by cosine similarity
Retrieval threshold
0.3 minimum cosine similarity (below this → fallback message)
History window
Last 10 turns (5 user + 5 assistant)
Embedding batch size
100 chunks per OpenAI API call
URL fetch timeout
10 seconds, max 5 redirect hops
What it runs on.
Frontend and backend
Next.js 15 App Router on Vercel. Server Components by default; client components only where interactivity is needed.
Database and auth
Supabase: Postgres for relational data, pgvector for embeddings, Auth for sessions, Storage for PDF uploads. Row Level Security on every table.
AI providers
OpenAI for embeddings only. Anthropic Claude for generation. No single-provider lock-in by design.
Billing
Dodo Payments (Merchant of Record — they handle tax and compliance).
Last updated 2026-05-13 · Was this helpful?