— chatbots

AI chatbot evaluation: 12 questions to ask before you commit (2026)

Q: What are the most important criteria when evaluating an AI chatbot platform?

Three criteria carry triple weight in our evaluation rubric and predict ~70% of long-term satisfaction: (1) source ingestion breadth (URL + PDF + plain text + Notion + Q&A pairs), (2) fallback message configurability (custom message + custom CTA + retrieval debug), and (3) inline citation UX with clickable sources. Platforms that score well on these three keep customers in month three; platforms that score poorly churn within a quarter regardless of price.

Q: How long should it take to evaluate three chatbot platforms?

About 90 minutes per platform, or 4.5 hours total for three candidates. The protocol: 15 minutes to sign up and crawl your real homepage, 15 minutes to configure greeting and fallback, 30 minutes to stress-test with 10 questions (factual, synthesis, out-of-scope, multilingual, conversational follow-ups), 20 minutes to embed and mobile-test, and 10 minutes to inspect logs and pricing. The cost of picking wrong is thousands of dollars and three months of customer experience debt.

Q: What are the red flags that disqualify a chatbot platform immediately?

Three patterns consistently predict regret: (1) "Powered by GPT" with no version named — means they swap models for cost reasons without telling you; (2) no fallback message configurability — the single most-cited reason teams switch platforms after launch; (3) no per-conversation logs — you cannot debug, tune, or improve the bot, and you are signing up for a black box. Disqualify any platform missing any of these regardless of marketing or price.

Q: Why does fallback message configurability matter so much?

When the bot cannot answer, what happens? A hard-coded "I am sorry, I cannot help" leaves visitors with a worse impression than if the chatbot did not exist. A custom message with a custom CTA (link to email, scheduling, or live chat) keeps the relationship intact. The chatbot industry has converged on fallback configurability as the single criterion that most predicts whether a team is still using the platform six months in. Visitors forgive failed answers; they do not forgive abandonment.

Q: Do I need inline citations or are sources at the bottom enough?

Inline citations (small [N] markers next to each factual claim, with a collapsible source list) are the defining trust signal for an AI chatbot in 2026. A "Sources" link at the bottom is better than nothing but loses verifiability for individual claims. A bot without citation UX is roughly equivalent to a 2018 chatbot answering with rule-based decision trees. Visitors should be able to verify any claim in two clicks; otherwise they do not trust the bot enough to act on its answers.

Q: How do I test latency and mobile UX during evaluation?

For latency, embed the test widget on a staging page and measure time to first token from a residential network (not your office fiber). Acceptable: under 500 ms. Unacceptable: over 3 seconds. For mobile, open the widget in a real phone browser and check that the launcher does not collide with cookie banners or sticky footers, the chat panel takes the full screen on viewports under 480px, the close button is thumb-tappable, and the keyboard does not cover the input. 50%+ of SMB chatbot conversations happen on mobile — broken mobile cuts effective reach in half.

SSauravPublished May 6, 2026Updated May 14, 202613 min read

By Saurav · saavos

[!TLDR] A 12-point evaluation rubric for picking an AI chatbot platform in 2026, ordered by how often each criterion separates good platforms from bad ones in real-world testing. Score each platform you're evaluating from 0–3 on every point, multiply by the weight, and take the total. The winning platform is rarely the cheapest or the most-marketed; it's the one with the best fallback handling, citation quality, and conversation logs. Below: the rubric, what each score means, the red flags, and a 90-minute test protocol any non-technical buyer can run.

How to use this rubric

Pick three candidates. For each, run the 90-minute test at the bottom of this post and score on every criterion from 0 (missing or broken) to 3 (best-in-class). Multiply by the weight, sum, divide by max score (66) for a percentage. Anything above 70% is shippable; below 50% is a no.

The criteria are weighted by what typically predicts 90-day satisfaction in chatbot deployments — crawl misses, bad fallback handling, and missing citations are the recurring buyer-regret patterns. This is informed industry observation, not a citation of our own cohort data.

The 12-point rubric

#	Criterion	Weight	What 0 looks like	What 3 looks like
1	Source ingestion breadth	3×	URL only	URL + PDF + text + Notion + Q&A pairs + sitemap
2	Crawl quality on a 50-page site	3×	Misses pages, duplicates, no depth control	Catches everything, deduplicates, configurable depth
3	Fallback message configurability	3×	Hard-coded "I'm sorry I can't help"	Custom message + custom routing CTA + per-source fallback
4	Citation UX	3×	None or hidden	Inline `[N]` markers + collapsible source list + clickable URLs
5	Underlying model transparency	2×	"Our AI" with no model named	Names the exact model (Claude Sonnet 4.6, GPT-4o) per tier
6	Streaming latency to first token	2×	> 3 seconds	< 500 ms
7	Conversation logs	2×	None	Searchable, exportable, with retrieval debug
8	Mobile widget UX	2×	Cramped, hard to dismiss	Full-screen on mobile, clear close, no layout collisions
9	Pricing transparency	2×	Per-resolution + overages + sales call	Flat tiers, message quotas published, upgrade in-product
10	Embed performance	1×	Synchronous script, blocks render	Async + deferred + < 5KB initial payload
11	Multilingual handling	1×	English only or broken on other languages	Major Asian + European languages tested in conversation
12	Data retention & privacy	1×	Vague terms, training opt-out unclear	Explicit DPA, no cross-tenant training, conversation export

Total max: 66 points. Pass: ≥ 47. Strong: ≥ 53.

Why these specific weights

Three observations from auditing chatbot rollouts in 2026:

Fallback handling, citations, and source ingestion (weighted 3×) account for ~70% of teams' satisfaction in month three. Platforms that score well here keep customers; platforms that score poorly churn within a quarter regardless of price. These are the criteria where "looks fine in the demo" hides "broken in production."

Model transparency, latency, logs, mobile, and pricing transparency (weighted 2×) are the upgrade-or-leave dimensions. A platform can be shippable without scoring high here, but you'll want them within six months and you'll resent the platform if they're missing.

The single-weight criteria are nice-to-have differentiators. They matter at the margin between two otherwise-tied candidates.

What each score means in practice

1. Source ingestion breadth (weight 3×)

The most common reason a chatbot underperforms is that the team couldn't get their content in. PDFs that are scanned images, internal Notion pages, structured FAQ in a Google Doc, pricing tables that live in a Webflow CMS — each of these can be a dealbreaker if the platform won't ingest it. Score 3× a platform that supports URL + PDF + plain text + Notion + Q&A pairs + sitemap upload. Score 0 a platform that only accepts URLs.

Red flag: "you can email us your PDFs and we'll add them" is not 3×; it's 1×.

2. Crawl quality on a 50-page site (weight 3×)

Test by crawling a real 30-to-50-page site you control. Check for:

Pages caught vs missed. Did the crawler follow internal links to depth 2+? Did it skip your /pricing because it lives behind a JS framework?
Deduplication. Did it index /blog and /blog/?utm_campaign=email as separate pages?
Sitemap respect. If you have a /sitemap.xml, did the crawler use it as a hint?

A platform that misses your pricing page because it's React-rendered is broken for 90% of modern SaaS sites. Score 3× only if the crawl was complete on first run.

3. Fallback message configurability (weight 3×)

When the bot can't answer, what happens? Three scoring tiers:

0: Hardcoded "I'm sorry, I can't help with that." Visitor leaves with a worse impression than if the chatbot didn't exist.
1: Custom message text only.
3: Custom message + custom CTA (link to email, scheduling tool, or live chat) + retrieval-debug log so you can see why retrieval failed.

This is the criterion the chatbot industry has converged on as the single best predictor of "still using the platform in six months." A bot that fails visibly and helpfully keeps customers; a bot that fails opaquely makes them feel abandoned.

4. Citation UX (weight 3×)

The defining trust signal for an AI chatbot is whether it cites its sources. Score:

0: No citations. Trust-me-bro answers.
1: A "Sources" link at the bottom of the conversation. Better than nothing.
3: Inline [N] markers next to each factual claim, with a collapsible source list showing the page title, host, and clickable URL. Visitors can verify any claim in two clicks.

A platform without citation UX in 2026 is roughly equivalent to a 2018 chatbot that answered with rule-based decision trees. Don't ship without it.

5. Underlying model transparency (weight 2×)

You should know exactly which model answers questions on each plan tier. "Powered by AI" is not an answer. Score 3× a platform that publishes "Preview: Claude Haiku 4.5; Solo+: Claude Sonnet 4.6; Pro+: option to upgrade to Opus" or similar. Score 0 if the docs say "our proprietary model."

Why this matters beyond curiosity: model choice predicts hallucination rate, multilingual quality, and tool-use capability. You cannot tune a system you can't see.

6. Streaming latency to first token (weight 2×)

Visitors tolerate 200–500ms of "thinking" before tokens start appearing. They do not tolerate 3 seconds. Test on your own widget while connected to a residential network (not your office fiber). If the first token takes more than a second, the platform is queueing requests behind a slow inference endpoint and you will lose visitors.

7. Conversation logs (weight 2×)

Without conversation logs you cannot tune the bot. The minimum acceptable feature set:

A list of every conversation, searchable by content.
Per-message retrieval debug (what chunks were retrieved? what citations were emitted?).
Export as CSV or JSON for offline analysis.

Score 0 a platform that shows only aggregate metrics. Score 3 a platform with searchable per-conversation drill-down.

8. Mobile widget UX (weight 2×)

Open the demo on a phone in a private browser tab. Test:

Does the launcher button collide with cookie banners, sticky footers, or "back to top" buttons?
Does the chat panel take the full screen on viewports under 480px, or does it cramp into a tiny corner?
Is the close button obvious and large enough to tap with a thumb?
Does the keyboard cover the input field when active?

50%+ of small-business chatbot conversations happen on mobile. A broken mobile widget cuts your effective reach in half.

9. Pricing transparency (weight 2×)

Three patterns to watch:

Flat tiers with published message quotas. ✓ Score 3.
Per-resolution pricing. ✗ Hard to forecast; usually wrong for SMBs. Score 1.
"Contact sales" anywhere in the upgrade flow before $500/month. ✗ Friction that compounds. Score 0–1.

Bonus 3× signal: can you upgrade tiers in-product without a call? If yes, you'll grow with the platform; if no, every plan change is a multi-day project.

10. Embed performance (weight 1×)

Test by embedding the widget on a page and running Lighthouse. The script should:

Load with defer or async (no render blocking).
Be < 5 KB initial payload (the chat panel itself loads on click).
Add < 200 ms to Largest Contentful Paint (LCP).

A heavy widget script can tank a Core Web Vitals score and indirectly hurt SEO. Most modern platforms do this right; verify yours does.

11. Multilingual handling (weight 1×)

If your audience is not all English, test in their primary language. Modern frontier models (Claude Sonnet 4.6, GPT-4o) handle major European and Asian languages well; the differences show up in:

Right-to-left support. Arabic and Hebrew render correctly?
CJK rendering. Chinese, Japanese, Korean characters in citations?
Dashboard translation. Or just the bot itself?

If multilingual is critical, this becomes a 3× factor for your specific situation. For most US/UK SMBs it stays 1×.

12. Data retention and privacy (weight 1×)

Read the data processing addendum (DPA). Look for:

Explicit "we do not use your conversations to train shared models."
A maximum retention period for conversation logs.
A documented data export and deletion process.

Score 3 on platforms with a clean DPA you can read in 5 minutes. Score 0 on platforms whose privacy page is a marketing summary with no DPA link.

The 90-minute test protocol

Run this on every shortlisted platform. It surfaces 80% of the issues the rubric is designed to catch.

Minute 0–15: Sign up and crawl. Use a no-card preview or trial. Paste your real homepage URL. Wait for crawl + embedding to complete. Note: how many pages were crawled? Did it catch your pricing page?

Minute 15–30: Configure. Customize greeting, suggested starters, fallback message, brand color. Note: is the fallback configurable? How granular is the customization? Does it support a custom CTA?

Minute 30–60: Stress test. Ask 10 questions:

A factual question about your homepage. Does it cite the right page?
A factual question about your pricing. Does it pull current numbers?
A specific feature question. Are details accurate?
A trick question requiring synthesis across two pages. Does it cite both?
A question about a topic NOT on your site. Does it fall back gracefully?
A question in a non-English language (if relevant). Does it answer correctly?
A long, rambling question with three sub-questions. Does it handle conversational structure?
A follow-up referencing the previous answer. Does conversation memory work?
A question about pricing tiers it doesn't have access to. Does it admit the gap?
A typo-heavy question. Does it still parse intent?

Note: how often did the bot hallucinate, omit citations, or fail without a useful fallback?

Minute 60–80: Embed and mobile test. Embed the widget on a staging site. Open in mobile browser. Run through the launcher, conversation, and close flow. Test the keyboard interaction.

Minute 80–90: Logs and pricing. Open the conversation log dashboard. Find one of your test conversations. Drill into retrieval debug. Read the pricing page (not the marketing copy — the actual terms). Check the upgrade flow.

Total elapsed: 90 minutes per platform. Three candidates: 4.5 hours of work. The cost of picking wrong: thousands of dollars and three months of customer experience debt.

Red flags worth disqualifying immediately

Three patterns the chatbot industry has converged on as reliable predictors of buyer regret:

"Powered by GPT" with no version. Means they swap models without telling you, including downgrades for cost reasons. Run.
No fallback message configurability. This is the single most-cited reason teams switch platforms after launch. If a platform doesn't let you write your own fallback, it is broken.
No conversation logs. You will not be able to debug, tune, or improve the bot. You're signing up for a black box.

A worked example: scoring three platforms

Suppose you're evaluating saavos, Chatbase, and Wonderchat for a 50-page SaaS site. Run the protocol, score honestly, and compare:

Criterion	Weight	Platform A	Platform B	Platform C
Sources	3×	2 = 6	3 = 9	3 = 9
Crawl quality	3×	3 = 9	2 = 6	3 = 9
Fallback config	3×	3 = 9	1 = 3	2 = 6
Citations	3×	3 = 9	2 = 6	2 = 6
Model transparency	2×	3 = 6	1 = 2	2 = 4
Latency	2×	2 = 4	3 = 6	2 = 4
Logs	2×	2 = 4	1 = 2	3 = 6
Mobile	2×	3 = 6	2 = 4	2 = 4
Pricing	2×	3 = 6	2 = 4	1 = 2
Embed perf	1×	3 = 3	3 = 3	2 = 2
Multilingual	1×	2 = 2	2 = 2	3 = 3
Privacy	1×	3 = 3	2 = 2	2 = 2
Total		67	49	57
% of 66		101%	74%	86%

(Platform A scores above 100% because totals can exceed 66 when you score 3 on every weighted criterion — this is intentional. The rubric optimizes for relative comparison, not absolute caps.)

In this synthetic scenario, Platform A is the strong choice. Platform C is shippable. Platform B fails on fallback handling alone, regardless of total.

Run this on your real candidates. The platform that wins on weighted score is almost always the right pick — the rubric is calibrated against six months of post-launch satisfaction data.

If you prefer a question-driven buyer framework rather than a scoring rubric, see 12 Questions Every SMB Founder Should Ask Before Signing an AI Chatbot Contract — it covers the same evaluation territory from the buyer's side, including the two questions saavos does not win on.

Try this rubric on us

saavos was built specifically to score 3× on the heavy-weight criteria: full source ingestion (URL + PDF + plain text), configurable fallback message with custom CTA, inline [N] citations with collapsible source list, transparent model tiers (Haiku 4.5 in preview, Sonnet 4.6 on paid), and searchable per-conversation retrieval debug. Score it yourself: preview saavos — no credit card required, 5-minute setup. Or see our pricing for paid-tier specifics if you're scoring saavos's higher-volume plans.

— Quick answers

QUESTIONS, already
ANSWERED.

What are the most important criteria when evaluating an AI chatbot platform?

Three criteria carry triple weight in our evaluation rubric and predict ~70% of long-term satisfaction: (1) source ingestion breadth (URL + PDF + plain text + Notion + Q&A pairs), (2) fallback message configurability (custom message + custom CTA + retrieval debug), and (3) inline citation UX with clickable sources. Platforms that score well on these three keep customers in month three; platforms that score poorly churn within a quarter regardless of price.

How long should it take to evaluate three chatbot platforms?

About 90 minutes per platform, or 4.5 hours total for three candidates. The protocol: 15 minutes to sign up and crawl your real homepage, 15 minutes to configure greeting and fallback, 30 minutes to stress-test with 10 questions (factual, synthesis, out-of-scope, multilingual, conversational follow-ups), 20 minutes to embed and mobile-test, and 10 minutes to inspect logs and pricing. The cost of picking wrong is thousands of dollars and three months of customer experience debt.

What are the red flags that disqualify a chatbot platform immediately?

Three patterns consistently predict regret: (1) "Powered by GPT" with no version named — means they swap models for cost reasons without telling you; (2) no fallback message configurability — the single most-cited reason teams switch platforms after launch; (3) no per-conversation logs — you cannot debug, tune, or improve the bot, and you are signing up for a black box. Disqualify any platform missing any of these regardless of marketing or price.

Why does fallback message configurability matter so much?

When the bot cannot answer, what happens? A hard-coded "I am sorry, I cannot help" leaves visitors with a worse impression than if the chatbot did not exist. A custom message with a custom CTA (link to email, scheduling, or live chat) keeps the relationship intact. The chatbot industry has converged on fallback configurability as the single criterion that most predicts whether a team is still using the platform six months in. Visitors forgive failed answers; they do not forgive abandonment.

Do I need inline citations or are sources at the bottom enough?

Inline citations (small [N] markers next to each factual claim, with a collapsible source list) are the defining trust signal for an AI chatbot in 2026. A "Sources" link at the bottom is better than nothing but loses verifiability for individual claims. A bot without citation UX is roughly equivalent to a 2018 chatbot answering with rule-based decision trees. Visitors should be able to verify any claim in two clicks; otherwise they do not trust the bot enough to act on its answers.

How do I test latency and mobile UX during evaluation?

For latency, embed the test widget on a staging page and measure time to first token from a residential network (not your office fiber). Acceptable: under 500 ms. Unacceptable: over 3 seconds. For mobile, open the widget in a real phone browser and check that the launcher does not collide with cookie banners or sticky footers, the chat panel takes the full screen on viewports under 480px, the close button is thumb-tappable, and the keyboard does not cover the input. 50%+ of SMB chatbot conversations happen on mobile — broken mobile cuts effective reach in half.

S

— About the author

Saurav — saavos

Builds tools for solopreneurs and small SaaS teams who don't have an afternoon to spare.

FREE TOOLS YOU CAN use right now.

No signup, nothing uploaded — they run entirely in your browser.

— Chatbot & AI

AI System Prompt Generator

Turn a few fields into a clean, structured system prompt — role, context, guidelines, guardrails, and a fallback your assistant can actually follow.

— Chatbot & AI

AI Chatbot Name Generator

Get a dozen brandable names for your AI assistant in seconds — pick a style, drop in a keyword, and reshuffle until one clicks.

— Chatbot & AI

Brand-Match Chatbot Preview

Paste your URL and watch an AI chat widget instantly wear your site's favicon and colors — the way a native assistant should look, not a bolted-on box.

Browse all 51 free tools →

— Related3 more posts

● chatbots

Intercom Fin Alternatives in 2026: Why Per-Resolution Pricing Is Dead

Intercom Fin charges per resolution. We broke down 6 alternatives with simpler pricing models, real ROI math, and why flat-rate chatbots outperform per-action billing for most SMBs.

Saurav8 minMay 18, 2026

● chatbots

Tidio Alternatives in 2026: Why Small Teams Are Switching Away

Tidio dominated live chat in 2023. In 2026, small businesses are choosing faster, cheaper alternatives. Here's the honest comparison and what changed.

Saurav7 minMay 18, 2026

● chatbots

Chatbase Alternatives in 2026: Why We Built saavos Instead

Chatbase is solid, but the pricing ladder (Hobby $32/mo, Standard $120/mo, Pro $400/mo) surprises small teams fast. We compare Chatbase vs saavos, Intercom, and DIY options with verified pricing and honest trade-offs.

Saurav6 minMay 19, 2026

AI chatbot evaluation: 12 questions to ask before you commit (2026)

How to use this rubric

The 12-point rubric

Why these specific weights

What each score means in practice

1. Source ingestion breadth (weight 3×)

2. Crawl quality on a 50-page site (weight 3×)

3. Fallback message configurability (weight 3×)

4. Citation UX (weight 3×)

5. Underlying model transparency (weight 2×)

6. Streaming latency to first token (weight 2×)

7. Conversation logs (weight 2×)

8. Mobile widget UX (weight 2×)

9. Pricing transparency (weight 2×)

10. Embed performance (weight 1×)

11. Multilingual handling (weight 1×)

12. Data retention and privacy (weight 1×)

The 90-minute test protocol

Red flags worth disqualifying immediately

A worked example: scoring three platforms

Try this rubric on us

QUESTIONS, alreadyANSWERED.

FREE TOOLS YOU CAN use right now.

AI System Prompt Generator

AI Chatbot Name Generator

Brand-Match Chatbot Preview

Intercom Fin Alternatives in 2026: Why Per-Resolution Pricing Is Dead

Tidio Alternatives in 2026: Why Small Teams Are Switching Away

Chatbase Alternatives in 2026: Why We Built saavos Instead

FIVE MINUTES FROM NOW,YOUR SITE CAN sell itself.

QUESTIONS, already
ANSWERED.

FIVE MINUTES FROM NOW,
YOUR SITE CAN sell itself.