By Saurav | Founder of saavos | Building in public toward $10k MRR
[!TLDR] A 12-point evaluation rubric for picking an AI chatbot platform in 2026, ordered by how often each criterion separates good platforms from bad ones in real-world testing. Score each platform you're evaluating from 0–3 on every point, multiply by the weight, and take the total. The winning platform is rarely the cheapest or the most-marketed; it's the one with the best fallback handling, citation quality, and conversation logs. Below: the rubric, what each score means, the red flags, and a 90-minute test protocol any non-technical buyer can run.
Pick three candidates. For each, run the 90-minute test at the bottom of this post and score on every criterion from 0 (missing or broken) to 3 (best-in-class). Multiply by the weight, sum, divide by max score (66) for a percentage. Anything above 70% is shippable; below 50% is a no.
The criteria are weighted by what typically predicts 90-day satisfaction in chatbot deployments — crawl misses, bad fallback handling, and missing citations are the recurring buyer-regret patterns. We're pre-revenue at saavos so this is informed industry observation, not a citation of cohort data we don't have.
| # | Criterion | Weight | What 0 looks like | What 3 looks like |
|---|---|---|---|---|
| 1 | Source ingestion breadth | 3× | URL only | URL + PDF + text + Notion + Q&A pairs + sitemap |
| 2 | Crawl quality on a 50-page site | 3× | Misses pages, duplicates, no depth control | Catches everything, deduplicates, configurable depth |
| 3 | Fallback message configurability | 3× | Hard-coded "I'm sorry I can't help" | Custom message + custom routing CTA + per-source fallback |
| 4 | Citation UX | 3× | None or hidden | Inline [N] markers + collapsible source list + clickable URLs |
| 5 | Underlying model transparency | 2× | "Our AI" with no model named | Names the exact model (Claude Sonnet 4.6, GPT-4o) per tier |
| 6 | Streaming latency to first token | 2× | > 3 seconds | < 500 ms |
| 7 | Conversation logs | 2× | None | Searchable, exportable, with retrieval debug |
| 8 | Mobile widget UX | 2× | Cramped, hard to dismiss | Full-screen on mobile, clear close, no layout collisions |
| 9 | Pricing transparency | 2× | Per-resolution + overages + sales call | Flat tiers, message quotas published, upgrade in-product |
| 10 | Embed performance | 1× | Synchronous script, blocks render | Async + deferred + < 5KB initial payload |
| 11 | Multilingual handling | 1× | English only or broken on other languages | Major Asian + European languages tested in conversation |
| 12 | Data retention & privacy | 1× | Vague terms, training opt-out unclear | Explicit DPA, no cross-tenant training, conversation export |
Total max: 66 points. Pass: ≥ 47. Strong: ≥ 53.
Three observations from auditing chatbot rollouts in 2026:
Fallback handling, citations, and source ingestion (weighted 3×) account for ~70% of teams' satisfaction in month three. Platforms that score well here keep customers; platforms that score poorly churn within a quarter regardless of price. These are the criteria where "looks fine in the demo" hides "broken in production."
Model transparency, latency, logs, mobile, and pricing transparency (weighted 2×) are the upgrade-or-leave dimensions. A platform can be shippable without scoring high here, but you'll want them within six months and you'll resent the platform if they're missing.
The single-weight criteria are nice-to-have differentiators. They matter at the margin between two otherwise-tied candidates.
The most common reason a chatbot underperforms is that the team couldn't get their content in. PDFs that are scanned images, internal Notion pages, structured FAQ in a Google Doc, pricing tables that live in a Webflow CMS — each of these can be a dealbreaker if the platform won't ingest it. Score 3× a platform that supports URL + PDF + plain text + Notion + Q&A pairs + sitemap upload. Score 0 a platform that only accepts URLs.
Red flag: "you can email us your PDFs and we'll add them" is not 3×; it's 1×.
Test by crawling a real 30-to-50-page site you control. Check for:
/pricing because it lives behind a JS framework?/blog and /blog/?utm_campaign=email as separate pages?/sitemap.xml, did the crawler use it as a hint?A platform that misses your pricing page because it's React-rendered is broken for 90% of modern SaaS sites. Score 3× only if the crawl was complete on first run.
When the bot can't answer, what happens? Three scoring tiers:
This is the criterion the chatbot industry has converged on as the single best predictor of "still using the platform in six months." A bot that fails visibly and helpfully keeps customers; a bot that fails opaquely makes them feel abandoned.
The defining trust signal for an AI chatbot is whether it cites its sources. Score:
[N] markers next to each factual claim, with a collapsible source list showing the page title, host, and clickable URL. Visitors can verify any claim in two clicks.A platform without citation UX in 2026 is roughly equivalent to a 2018 chatbot that answered with rule-based decision trees. Don't ship without it.
You should know exactly which model answers questions on each plan tier. "Powered by AI" is not an answer. Score 3× a platform that publishes "Free tier: Claude Haiku 4.5; Starter+: Claude Sonnet 4.6; Pro+: option to upgrade to Opus" or similar. Score 0 if the docs say "our proprietary model."
Why this matters beyond curiosity: model choice predicts hallucination rate, multilingual quality, and tool-use capability. You cannot tune a system you can't see.
Visitors tolerate 200–500ms of "thinking" before tokens start appearing. They do not tolerate 3 seconds. Test on your own widget while connected to a residential network (not your office fiber). If the first token takes more than a second, the platform is queueing requests behind a slow inference endpoint and you will lose visitors.
Without conversation logs you cannot tune the bot. The minimum acceptable feature set:
Score 0 a platform that shows only aggregate metrics. Score 3 a platform with searchable per-conversation drill-down.
Open the demo on a phone in a private browser tab. Test:
50%+ of small-business chatbot conversations happen on mobile. A broken mobile widget cuts your effective reach in half.
Three patterns to watch:
Bonus 3× signal: can you upgrade tiers in-product without a call? If yes, you'll grow with the platform; if no, every plan change is a multi-day project.
Test by embedding the widget on a page and running Lighthouse. The script should:
defer or async (no render blocking).A heavy widget script can tank a Core Web Vitals score and indirectly hurt SEO. Most modern platforms do this right; verify yours does.
If your audience is not all English, test in their primary language. Modern frontier models (Claude Sonnet 4.6, GPT-4o) handle major European and Asian languages well; the differences show up in:
If multilingual is critical, this becomes a 3× factor for your specific situation. For most US/UK SMBs it stays 1×.
Read the data processing addendum (DPA). Look for:
Score 3 on platforms with a clean DPA you can read in 5 minutes. Score 0 on platforms whose privacy page is a marketing summary with no DPA link.
Run this on every shortlisted platform. It surfaces 80% of the issues the rubric is designed to catch.
Minute 0–15: Sign up and crawl. Use a free tier or trial. Paste your real homepage URL. Wait for crawl + embedding to complete. Note: how many pages were crawled? Did it catch your pricing page?
Minute 15–30: Configure. Customize greeting, suggested starters, fallback message, brand color. Note: is the fallback configurable? How granular is the customization? Does it support a custom CTA?
Minute 30–60: Stress test. Ask 10 questions:
Note: how often did the bot hallucinate, omit citations, or fail without a useful fallback?
Minute 60–80: Embed and mobile test. Embed the widget on a staging site. Open in mobile browser. Run through the launcher, conversation, and close flow. Test the keyboard interaction.
Minute 80–90: Logs and pricing. Open the conversation log dashboard. Find one of your test conversations. Drill into retrieval debug. Read the pricing page (not the marketing copy — the actual terms). Check the upgrade flow.
Total elapsed: 90 minutes per platform. Three candidates: 4.5 hours of work. The cost of picking wrong: thousands of dollars and three months of customer experience debt.
Three patterns the chatbot industry has converged on as reliable predictors of buyer regret:
Suppose you're evaluating saavos, Chatbase, and Wonderchat for a 50-page SaaS site. Run the protocol, score honestly, and compare:
| Criterion | Weight | Platform A | Platform B | Platform C |
|---|---|---|---|---|
| Sources | 3× | 2 = 6 | 3 = 9 | 3 = 9 |
| Crawl quality | 3× | 3 = 9 | 2 = 6 | 3 = 9 |
| Fallback config | 3× | 3 = 9 | 1 = 3 | 2 = 6 |
| Citations | 3× | 3 = 9 | 2 = 6 | 2 = 6 |
| Model transparency | 2× | 3 = 6 | 1 = 2 | 2 = 4 |
| Latency | 2× | 2 = 4 | 3 = 6 | 2 = 4 |
| Logs | 2× | 2 = 4 | 1 = 2 | 3 = 6 |
| Mobile | 2× | 3 = 6 | 2 = 4 | 2 = 4 |
| Pricing | 2× | 3 = 6 | 2 = 4 | 1 = 2 |
| Embed perf | 1× | 3 = 3 | 3 = 3 | 2 = 2 |
| Multilingual | 1× | 2 = 2 | 2 = 2 | 3 = 3 |
| Privacy | 1× | 3 = 3 | 2 = 2 | 2 = 2 |
| Total | 67 | 49 | 57 | |
| % of 66 | 101% | 74% | 86% |
(Platform A scores above 100% because totals can exceed 66 when you score 3 on every weighted criterion — this is intentional. The rubric optimizes for relative comparison, not absolute caps.)
In this synthetic scenario, Platform A is the strong choice. Platform C is shippable. Platform B fails on fallback handling alone, regardless of total.
Run this on your real candidates. The platform that wins on weighted score is almost always the right pick — the rubric is calibrated against six months of post-launch satisfaction data.
If you prefer a question-driven buyer framework rather than a scoring rubric, see 12 Questions Every SMB Founder Should Ask Before Signing an AI Chatbot Contract — it covers the same evaluation territory from the buyer's side, including the two questions saavos does not win on.
saavos was built specifically to score 3× on the heavy-weight criteria: full source ingestion (URL + PDF + plain text), configurable fallback message with custom CTA, inline [N] citations with collapsible source list, transparent model tiers (Haiku 4.5 on free, Sonnet 4.6 on paid), and searchable per-conversation retrieval debug. Score it yourself: start free on saavos — no credit card required, 5-minute setup. Or see our pricing for paid-tier specifics if you're scoring saavos's higher-volume plans.
Get the next post in your inbox
Honest writing on building, embedding, and shipping AI chatbots. No spam. Unsubscribe anytime.
Three criteria carry triple weight in our evaluation rubric and predict ~70% of long-term satisfaction: (1) source ingestion breadth (URL + PDF + plain text + Notion + Q&A pairs), (2) fallback message configurability (custom message + custom CTA + retrieval debug), and (3) inline citation UX with clickable sources. Platforms that score well on these three keep customers in month three; platforms that score poorly churn within a quarter regardless of price.
About 90 minutes per platform, or 4.5 hours total for three candidates. The protocol: 15 minutes to sign up and crawl your real homepage, 15 minutes to configure greeting and fallback, 30 minutes to stress-test with 10 questions (factual, synthesis, out-of-scope, multilingual, conversational follow-ups), 20 minutes to embed and mobile-test, and 10 minutes to inspect logs and pricing. The cost of picking wrong is thousands of dollars and three months of customer experience debt.
Three patterns consistently predict regret: (1) "Powered by GPT" with no version named — means they swap models for cost reasons without telling you; (2) no fallback message configurability — the single most-cited reason teams switch platforms after launch; (3) no per-conversation logs — you cannot debug, tune, or improve the bot, and you are signing up for a black box. Disqualify any platform missing any of these regardless of marketing or price.
When the bot cannot answer, what happens? A hard-coded "I am sorry, I cannot help" leaves visitors with a worse impression than if the chatbot did not exist. A custom message with a custom CTA (link to email, scheduling, or live chat) keeps the relationship intact. The chatbot industry has converged on fallback configurability as the single criterion that most predicts whether a team is still using the platform six months in. Visitors forgive failed answers; they do not forgive abandonment.
Inline citations (small [N] markers next to each factual claim, with a collapsible source list) are the defining trust signal for an AI chatbot in 2026. A "Sources" link at the bottom is better than nothing but loses verifiability for individual claims. A bot without citation UX is roughly equivalent to a 2018 chatbot answering with rule-based decision trees. Visitors should be able to verify any claim in two clicks; otherwise they do not trust the bot enough to act on its answers.
For latency, embed the test widget on a staging page and measure time to first token from a residential network (not your office fiber). Acceptable: under 500 ms. Unacceptable: over 3 seconds. For mobile, open the widget in a real phone browser and check that the launcher does not collide with cookie banners or sticky footers, the chat panel takes the full screen on viewports under 480px, the close button is thumb-tappable, and the keyboard does not cover the input. 50%+ of SMB chatbot conversations happen on mobile — broken mobile cuts effective reach in half.
Builds tools for solopreneurs and small SaaS teams who don't have an afternoon to spare.
Paste your URL. Train your bot. Drop one script tag. No credit card.