Plan — LiveKit voice for Cally

Context

Cally (the AI assistant on caladriushealth.ai) is text-only today. The repo was deliberately built to grow a voice runtime: prompt.js already has a fully written channel: 'voice' branch (spoken read-back of phone/email, acronym pronunciation, language hint), the VoiceButton exists but is disabled, and @cally/brain's own description calls it the "single source of truth imported by every Cally runtime (text Worker today; voice agent later)."

This plan adds real-time voice: a visitor clicks Voice, speaks to Cally in English/Hindi/other Indian languages, and Cally answers aloud — same persona, same knowledge base, same capture_lead flow — over WebRTC.

Decisions locked with the user: LiveKit Cloud free tier (managed SFU); Sarvam AI for STT+TTS (Indian-language-native); Claude Haiku stays the brain; multilingual from day one.

TCO model — lowest total cost of ownership

For a bursty, low-volume marketing widget the cheapest shape is pay-per-use, $0 when idle, no fixed monthly bill, no GPU: LiveKit Cloud free tier covers low volume; Sarvam + Claude bill only per actual conversation (cents). Self-hosting STT/TTS was rejected because good multilingual speech needs an always-on GPU (the prod box is CPU-only) — a fixed cost that would exceed the pay-per-use API spend at this volume, and hurt latency. Revisit only if sustained volume ever makes a GPU cheaper than the API spend (instrument usage first).

Component	Self-host cost	Pay-per-use	Verdict
LiveKit SFU (media relay)	Free, CPU-only — but ops + public WebRTC ingress	Free tier, per-minute at scale	Cloud free tier — zero ops at this volume
STT + TTS (Sarvam)	Always-on GPU (new fixed $$$) + lower Indian-lang quality	≈$0 idle, cents/call, best quality	Pay-per-use wins
LLM (Claude Haiku)	GPU + loses persona/brand	Fractions of a cent/turn	Keep Claude

Unified transcript (key UX)

A voice conversation streams into the same Cally chat transcript as text — the visitor's spoken words (live STT) and Cally's spoken replies appear as the same message bubbles, streaming just like text chat. Voice is a mode inside the existing panel, not a separate view; both write to one shared messages state and reuse the existing bubble rendering.

Why this shape

Claude has no speech-to-speech model, so voice is a pipeline: mic → STT → Claude → TTS → speaker. LiveKit's Agents framework wires that loop (VAD, turn-taking, barge-in). The agent must be a persistent process, so it cannot be a Cloudflare Worker — it runs as a container on the prod podman box (teqnirvana), connecting outbound to LiveKit Cloud (no inbound ports). The Worker stays, shrunk to minting room tokens.

Browser (livekit-client)  ──WebRTC audio──►  LiveKit Cloud (SFU)  ◄──► cally-agent (prod box)
   │  ▲ transcription segments (user STT + Cally TTS text) ──────────┐      │ Sarvam STT (Saaras v3)
   │  └─ folded into the SAME Cally chat transcript (shared bubbles) ┘      │ Claude Haiku (@cally/brain)
   └──POST /voice/token──► cally-worker (Cloudflare) ──JWT──┘               │ Sarvam TTS (Bulbul v3)
                                                                            └ capture_lead → fireLead()

Verified facts (research)

Sarvam LiveKit plugin exists for Node: @livekit/agents-plugin-sarvam (STT = Saaras, TTS = Bulbul, 22+ Indian languages). Confirmed in livekit/agents-js plugins/ dir.
Sarvam streaming APIs: STT WebSocket (Saaras v3, ~70 ms), TTS WebSocket (Bulbul v3, sub-250 ms first byte, Node SDK) — fits the latency budget.
No Node Anthropic plugin (agents-js/plugins/ has none; Python does). → we write a small custom LLM node. The worker's streamClaude() in cally-worker/src/index.js already implements exactly the Anthropic streaming + capture_lead tool loop we need — port it into the LLM-node interface.
Repo is a pnpm workspace (packages/*, cally-worker, widget), Node ≥20, @cally/brain is type:module, pure JS, zero deps — importable by the agent.

Components & files

1 · LiveKit Cloud project (ops, one-time)

Create a LiveKit Cloud project → obtain LIVEKIT_URL (wss://…), LIVEKIT_API_KEY, LIVEKIT_API_SECRET. Free dev tier is enough to start.
Obtain a Sarvam API key.

2 · Token endpoint — extend the existing Worker

cally-worker/src/index.js (+ cally-worker/package.json)

Add POST /voice/token: body { path, language } → mint a LiveKit AccessToken (via livekit-server-sdk) for a fresh room, identity = random visitor id, with roomJoin+canPublish+canSubscribe. Encode { path, language } into the room metadata (so the agent reads page context + language without another hop). Respond { url: LIVEKIT_URL, token, roomName }.
Reuse corsHeaders(); add a VOICE_RATELIMIT unsafe binding in wrangler.toml (e.g. 5/min/IP) mirroring the lead limiter.
New deps: livekit-server-sdk. New Worker secrets: LIVEKIT_API_KEY, LIVEKIT_API_SECRET, var LIVEKIT_URL.

3 · `cally-agent` — new workspace (the voice runtime)

new dir cally-agent/ — add to pnpm-workspace.yaml

package.json deps: @livekit/agents, @livekit/agents-plugin-sarvam, @livekit/agents-plugin-silero (VAD), @anthropic-ai/sdk, @cally/brain (workspace:*), livekit-server-sdk.
src/agent.js — registers a LiveKit worker (auto-dispatched to rooms). On job start: read room metadata → { path, language }; build the session (sarvam.STT / sarvam.TTS / silero.VAD / new CallyClaudeLLM(...)); voice the INITIAL_GREETING; publish transcriptions (the framework emits user-STT and agent-TTS text as lk.transcription segments) — this feeds the unified transcript.
src/claudeLLM.js — custom Anthropic LLM node implementing the agents-js LLM interface; ports streamClaude(): system = buildSystemPrompt({ path, channel:'voice', language }), messages.stream({ model:'claude-haiku-4-5-20251001', tools:[CAPTURE_LEAD_TOOL] }), on capture_lead → fireLead(env, name, contact, path, 'cally-voice').
src/index.js — process bootstrap. Containerfile — Node 20 slim, run on the prod box.
Container env: LIVEKIT_URL/API_KEY/API_SECRET, SARVAM_API_KEY, ANTHROPIC_API_KEY, MAKE_WEBHOOK_URL.

4 · Frontend — voice as a mode inside the existing chat panel

widget/src/CallyWidget.jsx · cally-widget.css · new widget/src/voice/useCallyVoice.js (lazy) · widget/package.json

The hard requirement: a voice conversation streams into the same Cally transcript as text — same messages array, same bubble rendering, same streaming feel. No separate panel; a voice session writes into the existing CallyPanel conversation.

Enable the VoiceButton (CallyWidget.jsx:218). On click it dynamic-imports ./voice/useCallyVoice.js so livekit-client is a separate lazy chunk — it must NOT land in the base ~83 KB bundle.
useCallyVoice(...): POST /voice/token → connect(url, token) via livekit-client; publish mic; subscribe to Cally's audio. Subscribe to transcription (RoomEvent.TranscriptionReceived); map visitor identity → role:'user' bubble, agent identity → role:'assistant' bubble; fold segments into the SAME messages state — interim segments update the streaming bubble in place, final segments commit. Spoken turns appear and stream exactly like typed ones, interleaved in one timeline.
Lift messages so text (useCallyChat) and voice both write one shared list. A small via:'text'|'voice' tag is fine, but the bubbles are identical.
Voice chrome inside the panel: language selector (English, हिन्दी, + a few Sarvam languages; default English), live state (connecting / listening / Cally speaking / muted), mute, end-call. Text input stays usable — switch between speaking and typing in one thread.
prefers-reduced-motion, full dark-mode parity, keyboard + ARIA (mic permission, mute, hang-up, live-region state announcements). Add livekit-client (lazy chunk only).

5 · Multilingual handling (day one)

Deterministic language selector (robust for v1) → flows via room metadata → agent sets Sarvam STT input language, TTS target_language_code, and buildSystemPrompt({ language }) so Claude replies in that language while keeping product terms (NHCX/ABDM/RCM/FHIR) in English (already specified in the voice prompt branch).
Fast-follow (not v1): Saaras auto-detect + mid-call language switch.

6 · Deployment

Worker: cd cally-worker && pnpm deploy (wrangler) — same manual step as today; set the three LiveKit secrets first.
Agent: build the image, run on the prod box via podman (outbound-only to LiveKit Cloud). Add a deploy/compose.cally-agent.yml-style service or a documented podman run; mirror the existing rootless-podman + restart-policy pattern. CI later; v1 is a documented manual deploy.
Frontend: normal website PR → existing GitHub Actions builds the widget (Vite emits the lazy chunk automatically) → S3/CloudFront. No build-system change.
Config: expose LIVEKIT_URL to the browser via the token response; window.CALLY_WORKER_URL already carries the API origin.

Risks / watch-items

Bundle budget: livekit-client is heavy — it MUST be a dynamic-import chunk loaded only on Voice click. Verify the base cally-widget.js stays ~83 KB gz.
Latency (standing directive): stream end-to-end — Sarvam STT streaming, Claude messages.stream, Sarvam TTS WebSocket; greet immediately; keep Haiku. Target < ~1.2 s first-audio after end-of-speech.
Secrets: Anthropic + Sarvam keys live only in the prod-box agent container and Worker secrets — never in the gitignored widget bundle or the repo.
Custom LLM node is the only non-off-the-shelf piece; a direct port of an already-working function (streamClaude), so risk is low.
Agent dispatch: confirm LiveKit Cloud auto-dispatches the registered worker to new rooms (vs. explicit dispatch); set agent name + room-name convention accordingly.

Rollout (suggested order)

Worker /voice/token + LiveKit/Sarvam accounts & secrets (no UI yet; curl-test the token).
cally-agent package + custom Claude node; run locally against LiveKit Cloud with a CLI test room; verify English round-trip, then Hindi.
Containerize + deploy agent to prod box.
Frontend: voice mode inside the existing panel (lazy chunk) + enable button; wire transcription segments into the shared messages state so spoken turns stream as chat bubbles; English first, then expose the language selector.
Verify capture_lead by voice (spoken read-back → lead lands in Make.com as cally-voice).

Verification (end-to-end)

Token: curl -X POST $WORKER/voice/token -d '{"path":"/","language":"en-IN"}' → { url, token, roomName }; decode JWT to confirm grants + metadata.
Agent local: run agent against LiveKit Cloud; speak → see STT transcript, hear Bulbul reply; switch to Hindi and repeat.
Brain parity: ask an RCM/NHCX question by voice; answer matches the text widget (same @cally/brain); guardrails hold (NHCX "in progress", never "certified").
Lead capture: give a name + phone by voice; Cally reads it back digit-by-digit, then fireLead posts to Make.com tagged cally-voice.
Frontend / unified transcript: click Voice, grant mic, hold a short conversation; verify spoken words and replies appear as streaming bubbles in the same Cally thread (interim text updating then committing), that typing and speaking interleave in one thread, and that mute, hang-up, dark mode, reduced-motion all work. Confirm the base bundle didn't grow.
Gates: pnpm --filter @cally/brain test, widget build, wrangler dry-run for the worker.

Caladrius Health AI · Cally voice plan · drafted 2026-06-20. Workflow: branch off dev in a worktree; raise PRs and leave them open (worker + agent deploys are manual, user-coordinated steps).

Context

TCO model — lowest total cost of ownership

Unified transcript (key UX)

Why this shape

Verified facts (research)

Components & files

1 · LiveKit Cloud project (ops, one-time)

2 · Token endpoint — extend the existing Worker

3 · cally-agent — new workspace (the voice runtime)

4 · Frontend — voice as a mode inside the existing chat panel

5 · Multilingual handling (day one)

6 · Deployment

Risks / watch-items

Rollout (suggested order)

Verification (end-to-end)

3 · `cally-agent` — new workspace (the voice runtime)