Caladrius Health AI · caladriushealth.ai

LiveKit voice for Cally

Real-time, multilingual voice for the Cally website assistant — same brain, same conversation transcript, lowest total cost of ownership.

LiveKit Cloud (free tier) Sarvam STT + TTS Claude Haiku brain Multilingual day one Unified transcript

Drafted 2026-06-20 · pay-per-use, ≈$0 idle, no GPU

Context

Cally (the AI assistant on caladriushealth.ai) is text-only today. The repo was deliberately built to grow a voice runtime: prompt.js already has a fully written channel: 'voice' branch (spoken read-back of phone/email, acronym pronunciation, language hint), the VoiceButton exists but is disabled, and @cally/brain's own description calls it the "single source of truth imported by every Cally runtime (text Worker today; voice agent later)."

This plan adds real-time voice: a visitor clicks Voice, speaks to Cally in English/Hindi/other Indian languages, and Cally answers aloud — same persona, same knowledge base, same capture_lead flow — over WebRTC.

Decisions locked with the user: LiveKit Cloud free tier (managed SFU); Sarvam AI for STT+TTS (Indian-language-native); Claude Haiku stays the brain; multilingual from day one.

TCO model — lowest total cost of ownership

For a bursty, low-volume marketing widget the cheapest shape is pay-per-use, $0 when idle, no fixed monthly bill, no GPU: LiveKit Cloud free tier covers low volume; Sarvam + Claude bill only per actual conversation (cents). Self-hosting STT/TTS was rejected because good multilingual speech needs an always-on GPU (the prod box is CPU-only) — a fixed cost that would exceed the pay-per-use API spend at this volume, and hurt latency. Revisit only if sustained volume ever makes a GPU cheaper than the API spend (instrument usage first).

ComponentSelf-host costPay-per-useVerdict
LiveKit SFU (media relay)Free, CPU-only — but ops + public WebRTC ingressFree tier, per-minute at scaleCloud free tier — zero ops at this volume
STT + TTS (Sarvam)Always-on GPU (new fixed $$$) + lower Indian-lang quality≈$0 idle, cents/call, best qualityPay-per-use wins
LLM (Claude Haiku)GPU + loses persona/brandFractions of a cent/turnKeep Claude

Unified transcript (key UX)

A voice conversation streams into the same Cally chat transcript as text — the visitor's spoken words (live STT) and Cally's spoken replies appear as the same message bubbles, streaming just like text chat. Voice is a mode inside the existing panel, not a separate view; both write to one shared messages state and reuse the existing bubble rendering.

Why this shape

Claude has no speech-to-speech model, so voice is a pipeline: mic → STT → Claude → TTS → speaker. LiveKit's Agents framework wires that loop (VAD, turn-taking, barge-in). The agent must be a persistent process, so it cannot be a Cloudflare Worker — it runs as a container on the prod podman box (teqnirvana), connecting outbound to LiveKit Cloud (no inbound ports). The Worker stays, shrunk to minting room tokens.

Browser (livekit-client)  ──WebRTC audio──►  LiveKit Cloud (SFU)  ◄──► cally-agent (prod box)
   │  ▲ transcription segments (user STT + Cally TTS text) ──────────┐      │ Sarvam STT (Saaras v3)
   │  └─ folded into the SAME Cally chat transcript (shared bubbles) ┘      │ Claude Haiku (@cally/brain)
   └──POST /voice/token──► cally-worker (Cloudflare) ──JWT──┘               │ Sarvam TTS (Bulbul v3)
                                                                            └ capture_lead → fireLead()

Verified facts (research)

Components & files

1 · LiveKit Cloud project (ops, one-time)

2 · Token endpoint — extend the existing Worker

cally-worker/src/index.js (+ cally-worker/package.json)

3 · cally-agent — new workspace (the voice runtime)

new dir cally-agent/ — add to pnpm-workspace.yaml

4 · Frontend — voice as a mode inside the existing chat panel

widget/src/CallyWidget.jsx · cally-widget.css · new widget/src/voice/useCallyVoice.js (lazy) · widget/package.json

The hard requirement: a voice conversation streams into the same Cally transcript as text — same messages array, same bubble rendering, same streaming feel. No separate panel; a voice session writes into the existing CallyPanel conversation.

5 · Multilingual handling (day one)

6 · Deployment

Risks / watch-items

Rollout (suggested order)

  1. Worker /voice/token + LiveKit/Sarvam accounts & secrets (no UI yet; curl-test the token).
  2. cally-agent package + custom Claude node; run locally against LiveKit Cloud with a CLI test room; verify English round-trip, then Hindi.
  3. Containerize + deploy agent to prod box.
  4. Frontend: voice mode inside the existing panel (lazy chunk) + enable button; wire transcription segments into the shared messages state so spoken turns stream as chat bubbles; English first, then expose the language selector.
  5. Verify capture_lead by voice (spoken read-back → lead lands in Make.com as cally-voice).

Verification (end-to-end)