Architecture — saavos Docs

← Docs / Architecture

System diagram

Bot owner (dashboard)
  │
  ├─ Add source (URL / PDF / text)
  │    └─► Ingest worker
  │          ├─ Fetch content (10s timeout, follow 5 redirects)
  │          ├─ Extract readable text (Mozilla Readability)
  │          ├─ Clean (strip nav/footer, collapse whitespace)
  │          ├─ Chunk (500 token target, 50 token overlap, 600 max)
  │          ├─ Embed (OpenAI text-embedding-3-small, 1536 dims, batches of 100)
  │          └─ Store vectors in Supabase pgvector (HNSW index)
  │
  └─ Customize bot (name, persona, fallback message)

Site visitor (embed widget)
  │
  └─ Send message
       └─► POST /api/chat
             ├─ Verify HMAC session token
             ├─ Embed question (OpenAI text-embedding-3-small)
             ├─ Cosine similarity search (top 5, threshold ≥ 0.3)
             ├─ If 0 chunks above threshold → return fallback message (no Anthropic call)
             ├─ Assemble prompt (system + persona + chunks + history + question)
             ├─ Stream response from Anthropic Claude
             └─ Return NDJSON stream (token events + citations + done)

Ingestion — how your content is processed

01 — Fetch

For URL sources, saavos fetches the page server-side with a 10-second timeout, following up to 5 redirects. It uses Mozilla Readability — the same algorithm as Firefox Reader Mode — to extract the main article body and discard navigation, ads, and boilerplate.

PDFs are uploaded to Supabase Storage and parsed with pdf-parse. Scanned (image-only) PDFs cannot be processed — they contain no extractable text. Use a text-based export or paste the content directly.

Plain text is accepted as-is. No fetching or parsing required.

02 — Clean and chunk

After extraction, the text is cleaned: whitespace is normalized, HTML entities are decoded, and heading structure is preserved as metadata for each chunk.

Text is then split into chunks using a recursive character splitter that respects heading and paragraph boundaries:

Chunk size target: 500 tokens
Hard maximum: 600 tokens — chunks above this are split even mid-sentence
Overlap: 50 tokens — the last 50 tokens of chunk N become the first 50 tokens of chunk N+1. This prevents answer boundaries from splitting mid-concept.
Metadata stored per chunk: source_id, chunk_index, heading_path, token_count

03 — Embed

Each chunk is sent to OpenAI's text-embedding-3-small model in batches of 100. This produces a 1536-dimensional vector representing the semantic meaning of the chunk.

Embeddings are stored in Supabase Postgres via the pgvector extension in a column typed vector(1536). An HNSW index makes similarity searches fast even at tens of thousands of chunks.

If embedding fails after 3 exponential-backoff retries (1s, 2s, 4s), the source is marked error and no partial vectors are kept.

Retrieval — how chat finds answers

04 — Embed the question

When a visitor sends a message, the question is embedded with the same text-embedding-3-small model used at ingest time. This produces a 1536-dimensional query vector.

05 — Vector search

A cosine similarity search runs over all chunks belonging to the bot, filtered strictly by bot_id — no cross-tenant leakage is possible. The top 5 chunks by similarity score are retrieved.

Chunks with cosine similarity below 0.3 are discarded. If all 5 candidates fall below this threshold, the Anthropic call is skipped entirely and the bot returns its configured fallback message. This costs zero generation tokens and prevents hallucinated answers to out-of-scope questions.

06 — Generate

The retrieved chunks, the bot's system prompt, persona, the last 10 turns of conversation history, and the visitor's question are assembled into a prompt. This is sent to Anthropic Claude with streaming enabled.

The model is instructed to answer only from the provided context. If the context does not contain the answer, the model says so — it does not guess. Citation references are included in the response stream.

Pinned parameters

Embedding model: OpenAI text-embedding-3-small, 1536 dimensions
Generation model: Anthropic — a frontier Claude model (default). Configurable per bot.
Chunk size: 500 token target, 600 token hard max, 50 token overlap
Retrieval K: Top 5 chunks by cosine similarity
Retrieval threshold: 0.3 minimum cosine similarity (below this → fallback message)
History window: Last 10 turns (5 user + 5 assistant)
Embedding batch size: 100 chunks per OpenAI API call
URL fetch timeout: 10 seconds, max 5 redirect hops

Tech stack

Frontend and backend: Next.js 15 App Router on Vercel. Server Components by default; client components only where interactivity is needed.
Database and auth: Supabase: Postgres for relational data, pgvector for embeddings, Auth for sessions, Storage for PDF uploads. Row Level Security on every table.
AI providers: OpenAI for embeddings only. Anthropic Claude for generation. No single-provider lock-in by design.
Billing: Dodo Payments (Merchant of Record — they handle tax and compliance).

Last updated 2026-05-13 · Was this helpful?

HOW IT works
UNDER THE HOOD.

System diagram

Ingestion — how your content is processed

01 — Fetch

02 — Clean and chunk

03 — Embed

Retrieval — how chat finds answers

04 — Embed the question

05 — Vector search

06 — Generate

Pinned parameters

Tech stack

Keep exploring

HOW IT worksUNDER THE HOOD.

System diagram

Ingestion — how your content is processed

01 — Fetch

02 — Clean and chunk

03 — Embed

Retrieval — how chat finds answers

04 — Embed the question

05 — Vector search

06 — Generate

Pinned parameters

Tech stack

Keep exploring

HOW IT works
UNDER THE HOOD.