Caladrius Health AI · caladriushealth.ai
Real-time, multilingual voice for the Cally website assistant — same brain, same conversation transcript, lowest total cost of ownership.
Cally (the AI assistant on caladriushealth.ai) is text-only today. The repo was
deliberately built to grow a voice runtime: prompt.js already has a fully
written channel: 'voice' branch (spoken read-back of phone/email, acronym pronunciation,
language hint), the VoiceButton exists but is disabled, and @cally/brain's own
description calls it the "single source of truth imported by every Cally runtime
(text Worker today; voice agent later)."
This plan adds real-time voice: a visitor clicks Voice, speaks to Cally in
English/Hindi/other Indian languages, and Cally answers aloud — same persona, same
knowledge base, same capture_lead flow — over WebRTC.
Decisions locked with the user: LiveKit Cloud free tier (managed SFU); Sarvam AI for STT+TTS (Indian-language-native); Claude Haiku stays the brain; multilingual from day one.
For a bursty, low-volume marketing widget the cheapest shape is pay-per-use, $0 when idle, no fixed monthly bill, no GPU: LiveKit Cloud free tier covers low volume; Sarvam + Claude bill only per actual conversation (cents). Self-hosting STT/TTS was rejected because good multilingual speech needs an always-on GPU (the prod box is CPU-only) — a fixed cost that would exceed the pay-per-use API spend at this volume, and hurt latency. Revisit only if sustained volume ever makes a GPU cheaper than the API spend (instrument usage first).
| Component | Self-host cost | Pay-per-use | Verdict |
|---|---|---|---|
| LiveKit SFU (media relay) | Free, CPU-only — but ops + public WebRTC ingress | Free tier, per-minute at scale | Cloud free tier — zero ops at this volume |
| STT + TTS (Sarvam) | Always-on GPU (new fixed $$$) + lower Indian-lang quality | ≈$0 idle, cents/call, best quality | Pay-per-use wins |
| LLM (Claude Haiku) | GPU + loses persona/brand | Fractions of a cent/turn | Keep Claude |
A voice conversation streams into the same Cally chat transcript as text — the
visitor's spoken words (live STT) and Cally's spoken replies appear as the same
message bubbles, streaming just like text chat. Voice is a mode inside the existing
panel, not a separate view; both write to one shared messages state and reuse the
existing bubble rendering.
Claude has no speech-to-speech model, so voice is a pipeline: mic → STT → Claude → TTS → speaker. LiveKit's Agents framework wires that loop (VAD, turn-taking, barge-in). The agent must be a persistent process, so it cannot be a Cloudflare Worker — it runs as a container on the prod podman box (teqnirvana), connecting outbound to LiveKit Cloud (no inbound ports). The Worker stays, shrunk to minting room tokens.
Browser (livekit-client) ──WebRTC audio──► LiveKit Cloud (SFU) ◄──► cally-agent (prod box)
│ ▲ transcription segments (user STT + Cally TTS text) ──────────┐ │ Sarvam STT (Saaras v3)
│ └─ folded into the SAME Cally chat transcript (shared bubbles) ┘ │ Claude Haiku (@cally/brain)
└──POST /voice/token──► cally-worker (Cloudflare) ──JWT──┘ │ Sarvam TTS (Bulbul v3)
└ capture_lead → fireLead()
@livekit/agents-plugin-sarvam
(STT = Saaras, TTS = Bulbul, 22+ Indian languages). Confirmed in livekit/agents-js plugins/ dir.agents-js/plugins/ has none; Python does).
→ we write a small custom LLM node. The worker's streamClaude() in
cally-worker/src/index.js already implements exactly the Anthropic streaming +
capture_lead tool loop we need — port it into the LLM-node interface.packages/*, cally-worker, widget), Node ≥20,
@cally/brain is type:module, pure JS, zero deps — importable by the agent.LIVEKIT_URL (wss://…), LIVEKIT_API_KEY, LIVEKIT_API_SECRET. Free dev tier is enough to start.cally-worker/src/index.js (+ cally-worker/package.json)
POST /voice/token: body { path, language } → mint a LiveKit AccessToken
(via livekit-server-sdk) for a fresh room, identity = random visitor id, with
roomJoin+canPublish+canSubscribe. Encode { path, language } into the room
metadata (so the agent reads page context + language without another hop).
Respond { url: LIVEKIT_URL, token, roomName }.corsHeaders(); add a VOICE_RATELIMIT unsafe binding in wrangler.toml
(e.g. 5/min/IP) mirroring the lead limiter.livekit-server-sdk. New Worker secrets: LIVEKIT_API_KEY,
LIVEKIT_API_SECRET, var LIVEKIT_URL.cally-agent — new workspace (the voice runtime)new dir cally-agent/ — add to pnpm-workspace.yaml
package.json deps: @livekit/agents, @livekit/agents-plugin-sarvam,
@livekit/agents-plugin-silero (VAD), @anthropic-ai/sdk, @cally/brain (workspace:*), livekit-server-sdk.src/agent.js — registers a LiveKit worker (auto-dispatched to rooms). On job start:
read room metadata → { path, language }; build the session
(sarvam.STT / sarvam.TTS / silero.VAD / new CallyClaudeLLM(...)); voice the
INITIAL_GREETING; publish transcriptions (the framework emits user-STT and
agent-TTS text as lk.transcription segments) — this feeds the unified transcript.src/claudeLLM.js — custom Anthropic LLM node implementing the agents-js
LLM interface; ports streamClaude(): system = buildSystemPrompt({ path, channel:'voice', language }),
messages.stream({ model:'claude-haiku-4-5-20251001', tools:[CAPTURE_LEAD_TOOL] }),
on capture_lead → fireLead(env, name, contact, path, 'cally-voice').src/index.js — process bootstrap. Containerfile — Node 20 slim, run on the prod box.LIVEKIT_URL/API_KEY/API_SECRET, SARVAM_API_KEY, ANTHROPIC_API_KEY, MAKE_WEBHOOK_URL.widget/src/CallyWidget.jsx · cally-widget.css · new widget/src/voice/useCallyVoice.js (lazy) · widget/package.json
The hard requirement: a voice conversation streams into the same Cally transcript
as text — same messages array, same bubble rendering, same streaming feel. No
separate panel; a voice session writes into the existing CallyPanel conversation.
VoiceButton (CallyWidget.jsx:218). On click it dynamic-imports
./voice/useCallyVoice.js so livekit-client is a separate lazy chunk — it must
NOT land in the base ~83 KB bundle.useCallyVoice(...): POST /voice/token → connect(url, token) via livekit-client;
publish mic; subscribe to Cally's audio. Subscribe to transcription
(RoomEvent.TranscriptionReceived); map visitor identity → role:'user' bubble, agent
identity → role:'assistant' bubble; fold segments into the SAME messages state —
interim segments update the streaming bubble in place, final segments commit. Spoken
turns appear and stream exactly like typed ones, interleaved in one timeline.messages so text (useCallyChat) and voice both write one shared list. A small
via:'text'|'voice' tag is fine, but the bubbles are identical.prefers-reduced-motion, full dark-mode parity, keyboard + ARIA (mic permission, mute,
hang-up, live-region state announcements). Add livekit-client (lazy chunk only).target_language_code, and buildSystemPrompt({ language })
so Claude replies in that language while keeping product terms (NHCX/ABDM/RCM/FHIR) in
English (already specified in the voice prompt branch).cd cally-worker && pnpm deploy (wrangler) — same manual step as today; set the three LiveKit secrets first.deploy/compose.cally-agent.yml-style service or a documented podman run;
mirror the existing rootless-podman + restart-policy pattern. CI later; v1 is a documented manual deploy.LIVEKIT_URL to the browser via the token response;
window.CALLY_WORKER_URL already carries the API origin.livekit-client is heavy — it MUST be a dynamic-import chunk loaded only on Voice click. Verify the base cally-widget.js stays ~83 KB gz.messages.stream, Sarvam TTS WebSocket; greet immediately; keep Haiku. Target < ~1.2 s first-audio after end-of-speech.streamClaude), so risk is low./voice/token + LiveKit/Sarvam accounts & secrets (no UI yet; curl-test the token).cally-agent package + custom Claude node; run locally against LiveKit Cloud with a CLI test room; verify English round-trip, then Hindi.messages state so spoken turns stream as chat bubbles; English first, then expose the language selector.capture_lead by voice (spoken read-back → lead lands in Make.com as cally-voice).curl -X POST $WORKER/voice/token -d '{"path":"/","language":"en-IN"}' → { url, token, roomName }; decode JWT to confirm grants + metadata.@cally/brain); guardrails hold (NHCX "in progress", never "certified").fireLead posts to Make.com tagged cally-voice.pnpm --filter @cally/brain test, widget build, wrangler dry-run for the worker.