# Voice Agent Platform — Backend Design

## Is This Possible with LiveKit?

**Yes.** The design in this document is **possible with LiveKit**, but LiveKit is only one part of the stack.

- **What LiveKit gives you (out of the box)**  
  - **Real-time transport**: WebRTC rooms, participants, audio/video tracks, low-latency media.  
  - **Agent dispatch**: Workers connect to the server at `/agent` with an agent token; the server assigns **jobs** (room or participant) and gives the worker a **token to join the room**.  
  - **No STT, TTS, LLM, RAG, CRM, or business logic** — LiveKit does not run your AI pipeline or integrations.

- **What you build**  
  - **Voice agent worker** (e.g. Python with `livekit-agents`): connects to LiveKit, receives job → joins room → subscribes to user audio → runs **your** STT (Deepgram/Sarvam) → **your** LLM (Sarvam/other) + RAG → **your** TTS (ElevenLabs/Sarvam) → publishes audio back to the room.  
  - **Django (or any backend)**: auth, token issuance (user + agent JWTs with same key/secret as LiveKit), storage (PostgreSQL), RAG/vector DB, webhooks, MCP/tools.  
  - **MCP servers / integrations**: CRM, tickets, scheduling, search, DB-as-KB, WhatsApp, SMS, telephony — your code or separate services that the worker or Django call.  
  - **Voice library, bot config, guardrails, etc.**: your data and APIs; LiveKit is unaware of these.

- **Summary**  
  LiveKit is the **real-time layer** (rooms + agent job dispatch). Everything else in this doc (STT, LLM, TTS, RAG, knowledge base, bot settings, MCP tools, WhatsApp, SMS, voice library) is **your application** that uses LiveKit for real-time voice. So: **yes, it is possible in/with LiveKit** — as long as “in LiveKit” means “using LiveKit for real-time voice and agent dispatch,” and the rest is implemented in your backend and worker.

- **What about all tools and bot settings?**  
  **LiveKit does not provide any tools or bot settings.** It only provides rooms and agent job assignment. Everything below is **your** responsibility:
  - **Tools** — CRM (Salesforce, Zoho, HubSpot, etc.), tickets (Zendesk, Jira, Freshdesk), ServiceNow, scheduling (Cal.com, Acuity, HouseCall Pro), automation (Zapier, Make), search (Exa, Parallel), DB-as-KB, WhatsApp, SMS/telephony, and **system tools** (End conversation, Detect language, Skip turn, Transfer to agent, Transfer to number, Play keypad tone, Voicemail detection): all are implemented in your **voice agent worker** and/or **Django** (or MCP servers the worker calls). The worker gets user speech → LLM may return a tool call → worker executes the tool via your backend or MCP → result goes back to LLM or is spoken.
  - **Bot settings** — Guardrails, overrides, webhooks (conversation initiation, post-call), limits (daily/concurrent calls, retention), conversational behavior (eagerness, silence timeouts, max duration, LLM cascade), client events, privacy (no logging, retention), ambience, speech uncertainty, first message (inbound/outbound), voice/TTS (similarity, speed, stability, output format, pronunciation dictionaries, text normalisation): all are stored in **your** DB (e.g. `agent_config`, `agent_voice_config`, `agent_guardrails`) and **enforced by your worker and Django**. LiveKit has no notion of these — your worker loads config from Django and behaves accordingly.
  So: **all tools and all bot settings are possible**, but they run entirely in **your** stack (worker + Django + MCP); LiveKit only carries the real-time audio and assigns the job to your worker.

---

## 1. High-Level Architecture

```
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                                    FRONTEND (Web / Mobile)                                │
│  • LiveKit client SDK → connect to room with user token                                  │
│  • Publish microphone audio track                                                        │
│  • Subscribe to agent’s audio track                                                      │
└─────────────────────────────────────────────────────────────────────────────────────────┘
                    │ user token (JWT)                    │ real-time audio
                    ▼                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                         LOCAL LIVEKIT SERVER (Go – your repo)                             │
│  • WebRTC / rooms / participants                                                        │
│  • Agent dispatch: when user joins room → create Job → assign to worker                 │
│  • /agent WebSocket: workers connect with agent token                                   │
└─────────────────────────────────────────────────────────────────────────────────────────┘
                    │ job assignment (room + token)        │ agent joins, publishes audio
                    ▼                                      │
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                    VOICE AGENT WORKER (Python – separate process)                        │
│  • Connects to LiveKit as worker; on JobAssignment → joins room                          │
│  • Subscribes to user audio → STT (Deepgram/Sarvam) → transcript + language             │
│  • Transcript + history + RAG context → LLM (Sarvam/other) → response text              │
│  • Response text → TTS (ElevenLabs/Sarvam) → audio → publish to room                    │
│  • Optionally: POST each turn to Django (transcript, response, metadata)                 │
└─────────────────────────────────────────────────────────────────────────────────────────┘
                    │ HTTP POST turn/events                 │ read config / RAG
                    ▼                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                         DJANGO BACKEND (Python)                                          │
│  • Auth, user/session management                                                        │
│  • Issue user + agent JWTs (same key/secret as LiveKit server)                         │
│  • REST: get token, create conversation, ingest turns, list transcripts                 │
│  • Optional: create room via local LiveKit room API                                      │
│  • Optional: webhook receiver (room_finished, etc.)                                     │
└─────────────────────────────────────────────────────────────────────────────────────────┘
                    │                                      │
                    ▼                                      ▼
┌──────────────────────────────────┐    ┌──────────────────────────────────────────────┐
│     PostgreSQL (Primary DB)       │    │  Vector DB (e.g. pgvector / Pinecone / Qdrant) │
│  • Users, sessions                │    │  • Document chunks + embeddings                 │
│  • Conversations, turns          │    │  • Used for RAG: similarity search → chunks   │
│  • Agent config, API usage        │    │    passed to LLM as context                    │
│  • Knowledge base metadata        │    │                                                │
└──────────────────────────────────┘    └──────────────────────────────────────────────┘
```

---

## 2. Full Backend Workflow (Step by Step)

### 2.1 User starts a voice session

1. **Client** calls Django: e.g. `POST /api/sessions/start` or `GET /api/token?room=...`.
2. **Django**:
   - Authenticates user (or creates anonymous session).
   - Creates a **Conversation** row (status = active, room_name = requested or generated).
   - Optionally calls **local LiveKit server** room API to create room (if you have it).
   - Builds **user JWT** (identity, room_name, grants: publish, subscribe) with LiveKit key/secret.
   - Returns: `{ "token": "<jwt>", "room_name": "...", "conversation_id": "..." }`.
3. **Client** connects to LiveKit with token, joins room, publishes mic track.

### 2.2 Agent joins the room (triggered by LiveKit server)

4. **LiveKit server** (on room created or participant joined, depending on job type):
   - Dispatches an **agent job** (room or participant) to a registered worker.
5. **Voice agent worker** (already connected to LiveKit at `/agent` with agent token):
   - Receives **JobAssignment** (room name + token to join as agent).
   - Joins the room as a participant, subscribes to the user’s audio track.
   - Optionally notifies Django: e.g. `POST /api/conversations/<id>/agent_joined` with room_sid, job_id (for your records).

### 2.3 Per user utterance (STT → RAG → LLM → TTS)

6. **Agent worker** receives user audio (stream or segments):
   - Sends audio to **STT** (Deepgram or Sarvam).
   - Gets **transcript** + **detected_language**.
   - **(Transcript)** Persists:
     - Either worker **POSTs** to Django: `POST /api/conversations/<id>/turns` with `{ "transcript": "...", "detected_language": "...", "stt_provider": "deepgram" }` → Django creates **UserTurn** row and optionally **Conversation.detected_language**.
     - Or worker writes to a queue and a Django consumer writes to DB; either way, transcript ends up in **UserTurn.transcribed_text** (and related fields).
7. **RAG (knowledge base + vector DB)**:
   - Worker takes **transcript** (and maybe last N turns) and optionally **conversation_id** (to scope knowledge).
   - Builds a **query** (e.g. transcript as-is or rewritten for retrieval).
   - Calls **Django** or **vector DB directly**:
     - **Option A**: Django exposes `POST /api/rag/retrieve` with `{ "query": "...", "conversation_id": "...", "top_k": 5 }`. Django queries vector DB (e.g. pgvector), returns list of **relevant chunks** (text + source).
     - **Option B**: Worker has read-only access to vector DB and does similarity search itself; then calls Django only for turn storage.
   - Worker receives **context chunks** (texts from your knowledge base).
8. **LLM**:
   - Worker builds prompt: system prompt + “Respond in \<detected_language\>” + **RAG chunks** (as context) + conversation history + **current user transcript**.
   - Calls **LLM** (Sarvam or other). Gets **response text**.
9. **TTS**:
   - Worker sends **response text** to **TTS** (ElevenLabs/Sarvam) with language/voice.
   - Gets **audio**, publishes to room (user hears it).
10. **Agent response persistence**:
    - Worker **POSTs** to Django: e.g. `POST /api/conversations/<id>/turns/<turn_id>/response` with `{ "response_text": "...", "llm_provider": "...", "tts_provider": "...", "language": "..." }`.
    - Django creates/updates **AgentResponse** row (linked to UserTurn). Optionally stores **usage** (tokens/chars) in **APIUsage** for billing/debugging.

### 2.4 End of session

11. **User** leaves room (or closes app).
12. **LiveKit server** can send **webhook** to Django (e.g. `participant_left`, `room_finished`) if configured.
13. **Django** sets **Conversation.status** = ended, **ended_at** = now. Optionally worker also notifies on disconnect so you can mark conversation ended even without webhook.

### 2.5 Getting the transcript

14. **Transcript** is already stored as **UserTurn.transcribed_text** (and **AgentResponse.response_text** for agent side).
15. **APIs** (examples):
    - **Full conversation transcript**: `GET /api/conversations/<id>/transcript` → returns ordered list of turns: `[{ "role": "user", "text": "...", "language": "...", "created_at": "..." }, { "role": "agent", "text": "...", "created_at": "..." }]`.
    - **Single turn**: `GET /api/conversations/<id>/turns/<turn_id>` → user transcript + agent response for that turn.
    - **Export**: same data in JSON/CSV or formatted text for download.

So: **transcript comes from STT → stored in UserTurn (and optionally AgentResponse) → served by Django REST.**

---

## 3. Database Architecture (PostgreSQL)

### 3.1 Entity relationship (logical)

```
┌─────────────┐       ┌──────────────────┐       ┌─────────────┐
│   User /    │       │   Conversation   │       │  UserTurn    │
│   Session   │──1:N──│                  │──1:N──│  (STT out)  │
└─────────────┘       └──────────────────┘       └──────┬───────┘
        │                           │                    │
        │                           │                    │ 1:1 or 1:N
        │                           │                    ▼
        │                           │             ┌─────────────┐
        │                           │             │AgentResponse│
        │                           │             │(LLM+TTS)    │
        │                           │             └─────────────┘
        │                           │
        │                           │             ┌─────────────┐
        │                           └─────────────│ APIUsage    │
        │                                         │(per call)   │
        │                                         └─────────────┘
        │
        │                           ┌──────────────────┐
        │                           │ AgentConfig      │
        │                           │(LLM/TTS/STT cfg) │
        │                           └──────────────────┘
        │
        │                           ┌──────────────────┐       ┌─────────────┐
        └───────────────────────────│ KnowledgeBase    │──1:N───│ Document    │
                                    │ (metadata)       │       │ (file/metadata)
                                    └────────┬────────┘       └──────┬──────┘
                                             │                       │
                                             │ 1:N                   │ 1:N
                                             ▼                       ▼
                                    ┌──────────────────┐   ┌─────────────┐
                                    │ Chunk (metadata   │   │ Embedding   │
                                    │ in PostgreSQL)   │   │ (in Vector DB)
                                    └──────────────────┘   └─────────────┘
```

Vector DB holds **embeddings** and optionally chunk text; PostgreSQL holds **metadata** (which document, chunk index, KB id) so you can map vector search results back to sources.

### 3.2 Tables and fields (detailed)

#### **user / session**

- **id** (PK)
- **identifier** (e.g. email or session_id for anonymous)
- **language_preference** (optional, for default TTS/LLM)
- **created_at**, **updated_at**

Use one table or split: **User** (registered) and **Session** (anonymous); Conversation then references either user_id or session_id.

---

#### **conversation**

- **id** (PK)
- **user_id** or **session_id** (FK)
- **room_name** (LiveKit room name)
- **room_sid** (optional, from LiveKit if you store it)
- **status** (e.g. active, ended)
- **detected_language** (last or first user language in this conversation)
- **agent_job_id** (optional, from LiveKit job)
- **started_at**, **ended_at**, **created_at**, **updated_at**
- **metadata** (JSON, optional: device, channel, etc.)

---

#### **user_turn** (one per user utterance; STT output)

- **id** (PK)
- **conversation_id** (FK)
- **order_index** (int, for ordering in transcript)
- **transcribed_text** (STT output — **this is the user transcript**)
- **detected_language** (e.g. en, hi)
- **stt_provider** (deepgram, sarvam)
- **stt_confidence** (optional)
- **raw_audio_url** or **audio_storage_path** (optional)
- **created_at**

---

#### **agent_response** (one per agent reply; LLM + TTS)

- **id** (PK)
- **conversation_id** (FK)
- **user_turn_id** (FK, links to the user message this replies to)
- **response_text** (LLM output — **agent side of transcript**)
- **llm_provider**
- **tts_provider**, **voice_id** (optional)
- **tts_audio_url** or **audio_storage_path** (optional)
- **language_used** (for TTS)
- **created_at**

---

#### **agent_config**

- **id** (PK)
- **name** (e.g. default, support_hi)
- **stt_provider** (default: deepgram)
- **tts_provider** (default: elevenlabs)
- **llm_provider**, **llm_model**
- **system_prompt** (text)
- **default_language**
- **knowledge_base_id** (optional FK, which KB to use for RAG)
- **created_at**, **updated_at**

---

#### **api_usage** (logging / billing)

- **id** (PK)
- **conversation_id** (optional FK)
- **user_turn_id** or **agent_response_id** (optional)
- **provider** (deepgram, sarvam, elevenlabs, llm)
- **operation** (stt, tts, llm, embed)
- **request_id** (idempotency)
- **metadata** (JSON: chars, tokens, model, etc.)
- **created_at**

---

#### **knowledge_base** (metadata in PostgreSQL)

- **id** (PK)
- **name**, **description**
- **embedding_model** (e.g. text-embedding-3-small, or Sarvam embed model)
- **chunk_size**, **chunk_overlap** (for ingestion)
- **is_active** (boolean)
- **created_at**, **updated_at**

---

#### **document** (sources you ingest into KB: files, Google Sheets, web)

- **id** (PK)
- **knowledge_base_id** (FK)
- **source_type** (enum: file, google_sheet, web_url)
- **name**, **file_path** or **storage_url** (for files)
- **mime_type**, **file_size** (for files: pdf, docx, txt, html, md)
- **google_sheet_id**, **google_sheet_url** (for source_type = google_sheet; MCP sync)
- **web_url** (for source_type = web_url; crawled content)
- **status** (pending, processing, ready, failed)
- **last_synced_at** (for Google Sheets / web: last vector DB sync)
- **created_at**, **updated_at**

---

#### **chunk** (per-chunk metadata in PostgreSQL; actual vector in Vector DB)

- **id** (PK)
- **document_id** (FK)
- **knowledge_base_id** (FK, denormalized for easier querying)
- **chunk_index** (int, order in document)
- **text** (plain text of chunk — may also be stored in vector DB depending on design)
- **token_count** or **char_count** (optional)
- **created_at**

In **vector DB**: you store **embedding** (vector) keyed by e.g. `chunk_id` (or external_id). So: vector DB has `chunk_id` (or global id) + vector; PostgreSQL has chunk metadata + document + KB. This way you can run similarity search in vector DB and then load chunk text and document metadata from PostgreSQL (or you store chunk text in vector DB too and only use PostgreSQL for document/KB metadata).

---

## 4. Knowledge Base and Vector DB Integration

### 4.1 Purpose

- **RAG**: User says something → you **retrieve** relevant chunks from your docs (via vector similarity) → you pass those chunks as **context** to the LLM so the agent answers using your data (e.g. product docs, FAQs, internal knowledge).

### 4.2 Flow

1. **Ingest (offline / admin)**  
   - Admin uploads a **document** (PDF, TXT, etc.) to Django; Django creates **Document** row and **KnowledgeBase** link.  
   - A **worker or Django task**:  
     - Splits document into **chunks** (by size/overlap from **KnowledgeBase** config).  
     - Creates **Chunk** rows (metadata) in PostgreSQL.  
     - For each chunk: calls **embedding API** (OpenAI, Sarvam, or other) → gets **vector**.  
     - Stores **vector** in **vector DB** with id = chunk_id (or global id).  
   - Optionally store **chunk text** in vector DB if the store supports it (e.g. pgvector can store text alongside vector); otherwise chunk text stays in **chunk.text** in PostgreSQL.

2. **Retrieve (at query time, in agent worker)**  
   - User transcript (and maybe last turn) → **query**.  
   - Optionally **rewrite query** for retrieval (e.g. “What is X?” → “X definition”).  
   - **Embed** the query with same **embedding_model** as used for chunks.  
   - **Vector similarity search**: find **top_k** nearest chunks (e.g. cosine similarity).  
   - Load **chunk text** (from PostgreSQL or from vector DB).  
   - Return list of **context strings** to the worker.

3. **Use in LLM**  
   - Worker builds prompt: system prompt + “Use only this context if relevant: \<chunks\>” + “User said: \<transcript\>” + “Respond in \<language\>”.  
   - LLM generates answer; agent speaks it via TTS.

### 4.3 Where each part runs

- **Django**:  
  - CRUD for KnowledgeBase, Document, Chunk (metadata).  
  - Trigger ingestion (e.g. Celery task that chunks, embeds, writes to vector DB).  
  - Optional: **RAG endpoint** `POST /api/rag/retrieve` that: embeds query, queries vector DB (e.g. pgvector in same PostgreSQL), returns chunk texts + source doc names.  
- **Vector DB**:  
  - Stores **embeddings** (and optionally chunk text).  
  - Supports **similarity search** (e.g. pgvector, Pinecone, Qdrant).  
- **Agent worker**:  
  - Either calls Django **/api/rag/retrieve** with query + conversation_id + top_k, **or** (if allowed) reads from vector DB and PostgreSQL itself.  
  - Injects returned chunks into LLM prompt.

### 4.4 Design choices

- **Vector DB choice**:  
  - **pgvector** (PostgreSQL extension): single DB, simpler ops; good for small/medium scale.  
  - **Pinecone / Qdrant / Weaviate**: separate service, good for large scale and advanced features.  
- **Embedding model**: Must be the same for ingest and query. Sarvam/OpenAI/etc. — store **embedding_model** on **KnowledgeBase** so worker knows which model to use for query embedding.  
- **Scoping**: Pass **knowledge_base_id** (or conversation → agent_config → knowledge_base_id) so RAG only searches the right KB.  
- **Conversation scope**: You can optionally filter chunks by **conversation_id** if you have per-conversation or per-tenant docs (e.g. add **conversation_id** or **tenant_id** to chunk metadata and filter in vector search or in Django).

### 4.5 Knowledge Base sources (files, Google Sheets, web)

#### Supported file uploads

- **PDF**, **Word (.docx)**, **.txt**, **HTML**, **.md** can be uploaded to a Knowledge Base.
- **Document** row: `source_type = file`, `mime_type` (e.g. application/pdf, application/vnd.openxmlformats-officedocument.wordprocessingml.document, text/plain, text/html, text/markdown), `file_path` or `storage_url`.
- **Ingestion pipeline**: Extract text (e.g. PyMuPDF/pdfplumber for PDF, python-docx for docx, raw for txt/html/md) → chunk → embed → vector DB; chunk metadata in PostgreSQL.
- **Check**: After ingest, run a **check** (e.g. sample query or validation) to confirm chunks are in vector DB and retrieval works; store status on Document.

#### Google Sheets as Knowledge Base (MCP server)

- **MCP server** exposes: **get** (read sheet data), **update** (edit cells), **append** (add rows), **delete** (delete rows/range).
- **Flow**: User links a Google Sheet to a Knowledge Base (e.g. provide sheet ID or URL). Django creates **Document** with `source_type = google_sheet`, `google_sheet_id` / `google_sheet_url`.
- **Sync to vector DB**: A job (cron or on-demand) calls MCP **get** to fetch current sheet content → convert to text (e.g. rows as paragraphs or structured text) → chunk → embed → **upsert** into vector DB (same chunk_id or external_id for updates). PostgreSQL **chunk** rows created/updated; **document.last_synced_at** updated.
- **Check**: After sync, **check** = verify chunk count / run a test retrieval; optional health indicator in UI.
- **Updates**: When user updates sheet via MCP (append/update/delete), trigger re-sync or incremental update so vector DB stays in sync with the sheet.

#### Web as Knowledge Base

- User provides **URL(s)** (e.g. https://example.com/docs). **Document** with `source_type = web_url`, `web_url`.
- **Crawl/index**: Backend (or worker) fetches URL, parses HTML (e.g. BeautifulSoup, readability), extracts main text, optionally follows same-domain links with a depth limit → produces one or more “pages” of text → chunk → embed → vector DB.
- **Check**: After crawl, same as files: verify chunks in vector DB and optional test retrieval.

---

## 5. Bot (Agent) Creation — Voice & TTS Configuration

Each **bot** (agent) has its own voice and TTS settings. These are stored per **agent_config** (or in related tables referenced by agent_config).

### 5.1 Voice selection per bot

- **Voice**: Each bot chooses one **default voice** (e.g. ElevenLabs voice_id, or Sarvam voice id). Stored e.g. **agent_voice_config.voice_id**, **agent_voice_config.tts_provider**.
- **TTS model family**: Which TTS model family to use (e.g. elevenlabs multilingual v2, sarvam hindi). **agent_voice_config.tts_model_family**.
- **Similarity**, **speed**, **stability**: TTS parameters (0–1 or provider-specific). **agent_voice_config.similarity**, **speed**, **stability**.

### 5.2 Language and voice per language

- **Default language**: One primary language for the bot (e.g. en). **agent_config.default_language**.
- **Additional languages**: Bot can support more languages; each has its own config row in **agent_language_voice** (or JSON on agent_config):
  - **language_code** (e.g. hi, ta)
  - **voice_id** (same as default or different per language)
  - **tts_provider** (optional override per language)
- So: one bot → default language + N optional languages; per language → same voice or different voice.

### 5.3 TTS output format

- **Output format**: Select format for TTS output (e.g. **PCM 8000 Hz**, PCM 16000 Hz, mp3, etc.). **agent_voice_config.output_format** (enum or string).
- **Optimize streaming latency**: 0–4 scale (0 = quality-focused, 4 = lowest latency). **agent_voice_config.streaming_latency_optimization** (int 0–4).

### 5.4 Pronunciation dictionaries (lexicon)

- **Pronunciation dictionaries**: Lexicon for case-sensitive replacements; .pls (Pronunciation Lexicon Specification) file upload. **agent_voice_config.pronunciation_dictionary_url** or **agent_voice_config.pronunciation_dictionary_file_path** (and optional type: IPA, CMU for English-only).
- **Add dictionary**: Upload .pls file per bot; stored in storage and path saved on **agent_voice_config** (or **agent_pronunciation_dictionary** table: agent_config_id, file_path, is_active).

### 5.5 Text normalisation (voice only)

- **Text normalisation type**: Before sending text to TTS, numbers (and similar) are converted to words. Options per provider (e.g. default, spell_out). **agent_voice_config.text_normalisation_type** (enum/string).

### 5.6 First message (inbound vs outbound)

- **First message** can differ by direction:
  - **Inbound** (user initiated, e.g. incoming call): **agent_config.first_message_inbound** (text).
  - **Outbound** (bot initiated, e.g. outbound call): **agent_config.first_message_outbound** (text).
- Worker plays the appropriate first message via TTS when the conversation starts, based on conversation direction (stored on **conversation** or passed at start).

---

## 6. Bot Settings (Guardrails, Webhooks, Limits, Behavior, Tools)

All of the following are **per-bot** (per agent_config) settings used at runtime by the voice agent worker and/or Django.

### 6.1 Guardrails

- **Purpose**: Define boundaries for what the agent can say/do; reduce risk and keep behavior predictable.
- **Implementation**: Up to **15 guardrails** (e.g. no profanity, no PII, stay on topic, no medical advice, etc.). Each can be **toggled on/off** per bot.
- **Storage**: **agent_guardrails** (agent_config_id, guardrail_key, enabled boolean) or a JSON field **agent_config.guardrail_flags** (array of keys, or dict key → bool). Worker (or LLM wrapper) enforces enabled guardrails (e.g. post-process LLM output or use a guardrail service).

### 6.2 Overrides (client overrides at conversation start)

- **Overrides**: Which parts of the bot config can be **overridden by the client** when starting a conversation (e.g. language, first message, voice). Each override is a **toggle on/off**.
- **Storage**: **agent_config.client_override_keys** (JSON array of allowed keys, e.g. ["language", "first_message"]) or **agent_client_overrides** (agent_config_id, config_key, overridable boolean).
- **Runtime**: When client calls “start conversation”, they can send optional overrides; Django/worker only accept keys that are marked overridable.

### 6.3 Webhooks

- **Conversation Initiation Client Data Webhook**: For Twilio/SIP trunk calls (or similar), when a conversation is initiated, backend can **fetch client data** from an external URL (e.g. CRM, caller info). **agent_config.conversation_initiation_webhook_url** (optional), **auth method** (e.g. HMAC, Bearer). Worker or Django calls this before building first message/token.
- **Post-call Webhook**: Override per agent. After call ends, **POST** to a URL (e.g. transcription, summary). **agent_config.post_call_webhook_url**, **agent_config.post_call_webhook_auth_method** (e.g. HMAC). Payload can include conversation_id, transcript, duration; Django or worker triggers this on conversation end (or on LiveKit webhook room_finished).

### 6.4 Limits

- **Daily call limit**: Max calls per day per bot (e.g. 100000). **agent_config.daily_call_limit** (int).
- **Concurrent call limit**: Max simultaneous calls (e.g. -1 = use workspace default). **agent_config.concurrent_call_limit** (int).
- **Use subscription limit / exceed by 3x**: Toggle: allow exceeding workspace concurrency by up to 3x (with extra charge). **agent_config.allow_concurrency_overflow** (boolean).
- **User input audio format**: Format expected for ASR (e.g. PCM 8000 Hz). **agent_config.user_input_audio_format** (enum/string).
- **Keywords**: Comma-separated keywords to boost ASR accuracy. **agent_config.asr_keywords** (text).

### 6.5 Conversational behavior

- **Eagerness**: How eager the agent is to respond (high = respond quickly, low = wait longer). **agent_config.eagerness** (enum: low, normal, high or 0–1).
- **Spelling patience**: More patient when user is spelling numbers/names. **agent_config.spelling_patience** (e.g. auto, on, off).
- **Take turn after silence**: Max seconds since user last spoke; then agent responds and takes turn. -1 = wait indefinitely. **agent_config.turn_after_silence_seconds** (int, e.g. 7).
- **End conversation after silence**: Max seconds since user last spoke; then end call. -1 = disabled. **agent_config.end_after_silence_seconds** (int, e.g. -1).
- **Max conversation duration**: Max seconds for the whole conversation (e.g. 600). **agent_config.max_conversation_duration_seconds** (int).
- **Soft timeout**: Wait for LLM response before returning a message; -1 = disabled. **agent_config.soft_timeout_seconds** (int).
- **LLM cascade timeout**: Time before trying next LLM in cascade (e.g. 8 seconds). **agent_config.llm_cascade_timeout_seconds** (int).

### 6.6 Client events

- **Client events**: Which events are sent to the client (e.g. over LiveKit data channel). Options: audio, agent_response, client_tool_call, agent_chat_response_part, interruption, user_transcript, conversation_initiation_metadata, agent_response_correction, etc.
- **Storage**: **agent_config.client_events** (JSON array of event names) or **agent_client_events** (agent_config_id, event_key, enabled).

### 6.7 Privacy

- **Do not log / store**: Toggle — conversation contents not logged or stored by the platform; use post-call webhooks to get call info. **agent_config.no_logging** (boolean).
- **Conversations retention period**: Days to keep conversation data (-1 = unlimited). **agent_config.retention_days** (int).

### 6.8 Background ambience

- **Ambience type**: office, call_center, restaurant, salsa, concert, etc. **agent_config.ambience_type** (enum/string).
- **Ambience volume**: 0–1 or percentage. **agent_config.ambience_volume** (float).
- Worker mixes ambience with TTS output before publishing to room (or client-side if you send ambience as separate track).

### 6.9 Speech uncertainty

- **Clarity when speech uncertain**: Toggle (e.g. ask for clarification when confidence is low). **agent_config.speech_uncertainty_clarity_enabled** (boolean).
- **Confidence level**: Threshold below which to treat as “low confidence”. **agent_config.low_confidence_threshold** (float 0–1).
- **Low confidence message**: What to say when confidence is low (e.g. “Sorry, I didn’t catch that.”). **agent_config.low_confidence_message** (text).

### 6.10 System tools (built-in actions)

- **System tools**: Each can be **enabled/disabled** per bot. Stored e.g. **agent_system_tools** (agent_config_id, tool_key, enabled).
- **Tools** (examples):
  - **End conversation**
  - **Detect language**
  - **Skip turn**
  - **Transfer to agent** (human handoff)
  - **Transfer to number** (e.g. SIP/phone)
  - **Play keypad touch tone**
  - **Voicemail detection**
- Worker implements these; tool_key and enabled flag determine which are active for that bot.

### 6.11 Tech stack and libraries for Bot Settings and System tools

Recommended **tech stack** and **libraries** to implement bot settings and system tools reliably and maintainably.

| Area | Tech / library | Purpose |
|------|----------------|---------|
| **Backend API** | **Django**, **Django REST Framework (DRF)** | CRUD for agent_config, guardrails, overrides, webhooks, limits; validation; permissions per business_id/bot_id. |
| **Async tasks** | **Celery** (broker: Redis) | Post-call webhooks, conversation initiation webhook, retention cleanup, delayed jobs (e.g. “end after silence”). |
| **Guardrails** | **NeMo Guardrails** (NVIDIA), **Guardrails AI** (guardrails-ai), or **custom regex/PII libs** (e.g. **presidio**) | Enforce no profanity, no PII, topic bounds; post-process LLM output before TTS. |
| **Webhooks** | **httpx** (async), **requests**; **django-environ** for URLs | Call conversation-initiation and post-call webhook URLs; HMAC signing with **hmac** (stdlib) or **itsdangerous**. |
| **Limits (daily/concurrent)** | **Redis** (counters, incr/decr) | Daily call count per bot (key: `limits:bot:{id}:daily:{date}`); concurrent calls (`limits:bot:{id}:concurrent`); check before creating room. |
| **Overrides** | **Pydantic** or **DRF serializers** | Validate client override payload; allow only keys in `agent_config.client_override_keys`. |
| **Conversational behavior** | **asyncio** (timeouts, timers) in worker | Eagerness (VAD/silence detection), turn_after_silence_seconds, end_after_silence_seconds, max_conversation_duration_seconds; worker maintains per-call state and timers. |
| **Client events** | **LiveKit SDK** (data channel / room publish) | Worker publishes JSON events to room based on `agent_config.client_events`; frontend subscribes to data. |
| **System tools** | **Worker logic** + **LiveKit Room API** / **SIP provider SDK** | End conversation: worker disconnects from room. Detect language: use STT `detected_language`. Skip turn: no TTS reply. Transfer to agent/number: Twilio/Exotel etc. (MCP or **twilio**, **exotel** SDKs). Play keypad tone: emit DTMF or audio asset. Voicemail detection: telephony provider webhook or audio detection. |
| **Ambience** | **pydub**, **soundfile**, or **ffmpeg** (subprocess) | Mix pre-recorded ambience (office, call_center, etc.) with TTS output at `ambience_volume`; publish single track to room. |
| **Speech uncertainty** | **Worker logic** | Compare STT confidence to `low_confidence_threshold`; if below, play `low_confidence_message` via TTS or skip. |
| **Config loading** | **Django ORM**, **django-cacheops** or **Redis cache** | Worker fetches agent_config (+ voice_config, guardrails, system_tools) from Django; cache in Redis keyed by agent_name/bot_id to avoid DB on every turn. |

---

## 7. Database Additions for Bots and Knowledge Base Sources

### 7.1 New / extended tables (summary)

| Table / concept | Purpose |
|-----------------|--------|
| **document** | Extended with source_type (file, google_sheet, web_url), mime_type for pdf/docx/txt/html/md, google_sheet_id/url, web_url, last_synced_at. |
| **agent_voice_config** | One per agent_config: voice_id, tts_provider, tts_model_family, similarity, speed, stability, output_format, streaming_latency_optimization, text_normalisation_type, pronunciation_dictionary path/url. |
| **agent_language_voice** | Per-agent, per-language: language_code, voice_id (optional override), tts_provider override. |
| **agent_pronunciation_dictionary** | Optional: agent_config_id, file_path (e.g. .pls), is_active. |
| **agent_config** | Extended: first_message_inbound, first_message_outbound; guardrail_flags or link to agent_guardrails; client_override_keys; webhook URLs and auth; limits; conversational behavior; client_events; privacy; ambience; speech uncertainty; link to agent_system_tools. |
| **agent_guardrails** | agent_config_id, guardrail_key (e.g. no_profanity), enabled. |
| **agent_system_tools** | agent_config_id, tool_key (end_conversation, detect_language, …), enabled. |

(Other settings in 6.4–6.9 can live as columns on **agent_config** or in a single **agent_config.settings** JSON for flexibility.)

### 7.2 How this fits LiveKit

- **Bot creation** in your app = creating/editing **agent_config** (and related agent_voice_config, guardrails, tools, etc.). Each LiveKit **agent worker** is started with an **agent_name** (and optionally namespace) that maps to one **agent_config** in Django.
- Worker **fetches config** at startup or per job from Django: `GET /api/agents/<name>/config` → returns voice, TTS format, languages, first message, guardrails, limits, behavior, tools, KB id. Worker then runs STT → RAG → LLM → TTS and applies guardrails, timeouts, system tools, ambience, and client events over LiveKit.

### 7.3 Redis: queue and cache server (recommended)

Use **Redis** as the central **queue** and **cache** layer for reliability, rate limiting, and low-latency state. Recommended for production.

| Use case | How Redis is used | Library / component |
|----------|-------------------|---------------------|
| **Job queue** | Celery **broker**: async tasks (post-call webhooks, transcript persistence, retention cleanup, KB sync). | **Celery** with **redis** as broker (`CELERY_BROKER_URL=redis://...`). |
| **Result backend** | Store Celery task results so Django/worker can poll or get callback. | **Celery** result backend `CELERY_RESULT_BACKEND=redis://...`. |
| **Rate limiting** | Enforce **daily call limit** and **concurrent call limit** per bot. Keys: `limits:bot:{bot_id}:daily:{date}` (INCR, EXPIRE 24h), `limits:bot:{bot_id}:concurrent` (INCR on join, DECR on leave). | **redis-py** (e.g. `redis.incr`, `redis.decr`), or **django-ratelimit** (Redis backend). |
| **Caching** | Cache **agent_config** (and voice_config, guardrails, system_tools) by bot_id/agent_name to avoid DB hit on every turn. TTL e.g. 60–300s; invalidate on config update. | **django-redis** or **redis-py**; Django cache backend `CACHES = { "default": { "BACKEND": "django_redis.cache.RedisCache", "LOCATION": "redis://..." } }`. |
| **Session / call state** | Per-conversation state (e.g. last speech time, turn count) for **conversational behavior** (turn_after_silence, end_after_silence). Worker reads/writes by conversation_id. | **redis-py** (hashes or JSON); key `conv:{conversation_id}:state`. |
| **Distributed locks** | Ensure only one worker handles a given conversation or job (e.g. post-call webhook once per conversation). | **redis-py** with SET NX EX or **celery-once**, **redlock**. |
| **Pub/Sub** (optional) | Notify workers or Django when config changes, or broadcast “end call” from another service. | **redis-py** pub/sub or **channels** (Django Channels) with Redis channel layer. |

**Why Redis for queue**: Fast, in-memory, supports queues (list push/pop), TTL, atomic counters, and is the standard broker for Celery. Avoids overloading PostgreSQL with high-frequency updates (e.g. concurrent call count) and keeps async work off the request path.

**Deployment**: Run **Redis server** (single node or Redis Cluster for HA). Django and the voice agent worker connect via `REDIS_URL` (or host/port). Use the same Redis instance for broker, cache, and rate-limit counters, or separate instances per concern if needed.

---

## 8. MCP Servers — Overview and Credential Model

### 8.1 Why MCP and per-tenant credentials

- **MCP (Model Context Protocol)** servers expose **tools** and **resources** (e.g. CRM, tickets, search, DB) so the voice agent can call external APIs during a conversation.
- **Multi-tenant**: Each **client** (business) uses **their own** tokens, API keys, and secrets. Credentials are **scoped by business_id** (and optionally by **bot_id** so different bots can use different integrations).
- **Storage**: **business_integration** (or **mcp_connection**) table: **business_id**, **integration_type** (e.g. salesforce, zendesk, cal_com), **credentials** (encrypted: API key, secret, token, instance URL, etc.), **enabled**, **scoped_to_bot_ids** (optional JSON array; null = all bots under business). Worker or Django resolves **business_id** from the conversation (e.g. conversation → user → business or conversation.metadata.business_id) and **bot_id** from agent_config, then loads the right credentials to call the MCP server.
- **Runtime**: When the agent needs to run a tool (e.g. "Create Lead in Salesforce"), the worker calls the **MCP server** (or Django proxy that calls the provider API) with **business_id** and **bot_id**; backend uses the stored credentials for that business (and bot if scoped). No cross-tenant leakage.

### 8.2 Tool execution flow

- LLM (with function/tool-calling) decides to invoke a tool (e.g. `salesforce_create_lead`).
- Worker looks up **business_id** and **bot_id** for the current conversation.
- Worker calls **Django** `POST /api/tools/execute` with `{ "tool_key": "salesforce_create_lead", "params": {...}, "business_id": "...", "bot_id": "..." }` **or** worker calls an **MCP server** that has already been configured with that business's credentials (e.g. env or config per request).
- Django (or MCP server) uses **business_integration** credentials for that integration type and business → calls Salesforce (or Zendesk, etc.) API.
- Result returned to worker → passed back to LLM or spoken via TTS.

---

## 9. MCP Server Catalog (CRM, Tickets, Scheduling, Automation, Search, DB, WhatsApp, SMS/Telephony)

### 9.1 CRM MCP servers

**Providers**: Zoho, Salesforce, LSQ, Sell.do, HubSpot.

**Purpose**: Let the voice agent look up and update leads, contacts, accounts, opportunities, tasks, events, cases (and run query/search) during the call. Each provider has its own MCP server or adapter; tools are normalized per provider.

**Example tools (Salesforce-style; map to equivalent in Zoho, HubSpot, etc.)**:

| Tool | Description |
|------|-------------|
| Get Lead | Fetch lead by ID |
| Create Lead | Create a new lead |
| Update Lead | Update existing lead |
| Get Contact | Fetch contact by ID |
| Create Contact | Create contact |
| Update Contact | Update contact |
| Get Account | Fetch account by ID |
| Create Account | Create account |
| Update Account | Update account |
| Get Opportunity | Fetch opportunity |
| Create Opportunity | Create opportunity |
| Update Opportunity | Update opportunity |
| Create Task | Create task |
| Get Task | Get task |
| Update Task | Update task |
| Create Event | Create calendar event |
| Get Event | Get event |
| Update Event | Update event |
| Create Case | Create support case |
| Get Case | Get case |
| Update Case | Update case |
| Query | Run SOQL/query (Salesforce) or equivalent |
| Search | Global search |

**Implementation**: One MCP server (or Django app) per CRM provider; each exposes the above tools. Credentials (OAuth tokens, API keys, instance URL) stored per **business_id** in **business_integration**. Agent config can enable which CRM tools are available for that bot.

---

### 9.2 Ticket / support MCP servers

**Providers**: Zendesk, Jira, Freshdesk.

**Purpose**: Create, update, and manage support tickets; search tickets; add comments; manage users and orgs; trigger the AI agent when new ticket comments are added (webhook to start or continue conversation).

**Example tools (Zendesk-style; similar for Jira/Freshdesk)**:

- Create Ticket, List Tickets, Show Ticket, Update Ticket, Show Many Tickets, Create Many Tickets, Update Many Tickets, Delete Ticket, Delete Many Tickets
- List Ticket Comments, Merge Tickets, List Problems, Add Tags, Remove Tags, Add Comment, Search
- Get User, List Users, Search Users, Search User, Create User, Update User, Delete User, Show Many Users, Create Or Update User, Create Or Update Many Users, Update Many Users
- Get User Requested Tickets, Get User Assigned Tickets, Get User Ccd Tickets
- Get Organization, Get Organization Tickets
- Log Call, Check Availability

**Triggers**: Configure webhook in Zendesk (or Jira/Freshdesk) so when a new comment is added, they call your backend so you can start a LiveKit conversation or inject context into an existing one and the same AI agent can respond in the ticket or via voice.

**Implementation**: MCP server (or Django + Celery) implements each tool using provider API; credentials per **business_id**. Optional **trigger_webhook** endpoint in Django that Zendesk/Jira/Freshdesk call.

---

### 9.3 ServiceNow MCP server

**Purpose**: Automate IT service management: incidents, cases, knowledge base, tasks, assignment groups.

**Example tools**:

- Get Incident, Create Incident, Update Incident, Query Incidents
- Get User, Search User By Email, Search User By Phone
- Create Case, Get Case, Update Case, Query Cases
- Search Knowledge, Get Knowledge Article
- Create Task, Get Tasks, Update Task
- Get Assignment Groups

**Implementation**: One MCP server (or adapter) for ServiceNow; credentials (instance URL, username, password or OAuth) per **business_id**. Used during call for RAG (Search Knowledge, Get Knowledge Article) and for creating/updating tickets and tasks.

---

### 9.4 Scheduling MCP servers

**Providers**: Cal.com, Acuity, HouseCall Pro.

**Purpose**: Let the agent check availability, create/cancel bookings during or after the call.

**Example tools (Cal.com-style)**:

- Calcom Create Booking
- Calcom Get Available Slots
- Calcom Get All Bookings
- Calcom Get Booking
- Calcom Cancel Booking

(Equivalent tools for Acuity and HouseCall Pro: get availability, create appointment, get/cancel booking.)

**Implementation**: One MCP server per scheduling provider; credentials (API key, account ID) per **business_id**. Agent can say "I've booked you for 3pm tomorrow" after calling Create Booking.

---

### 9.5 Automation MCP servers (Zapier, Make)

**Purpose**: Invoke automation workflows (Zapier Zaps, Make scenarios) as **tools** from the voice agent. E.g. "When user says 'send this to my team', trigger Zapier webhook."

**Implementation**: MCP server exposes tools like **zapier_trigger** (webhook URL + payload) or **make_run_scenario** (scenario ID, input). Credentials (webhook URLs, API keys) stored per **business_id**. No need to list all Zapier/Make actions; just "invoke this webhook" or "run this scenario" as a tool.

---

### 9.6 Search MCP servers (Exa, Parallel)

**Purpose**: Semantic / AI-oriented search. Exa (and Parallel) provide **search-only** tools so the agent can find up-to-date web content, research topics, and cite sources during the call.

**Tools**: Single tool or a few (e.g. **exa_search** with query, num_results, optional filters). Returns titles, snippets, URLs. Agent uses results as context for LLM or speaks a summary.

**Implementation**: MCP server calls Exa/Parallel API; API key stored per **business_id**. Used only for search (no CRM/ticket side effects).

---

### 9.7 Database as Knowledge Base (MCP server)

**Purpose**: Use a **whole database** (or selected tables/views) as a knowledge base: export schema + data to text, chunk, embed, upsert into vector DB, then use during call like any other KB. Useful for product catalogs, internal DBs, etc.

**Flow**:

1. **Admin** connects a database via MCP (or Django UI): provide connection string (or host, port, db, user, password) scoped to **business_id**. Optionally select tables/views or "full DB".
2. **Backend** (scheduled or on-demand job): Introspect schema; for each table/view, export rows (or sample) to text (e.g. JSON rows or "Table X: row1, row2..."). Chunk the text (by size/overlap). Embed chunks; upsert into vector DB with metadata (e.g. table name, row id). Store chunk metadata in PostgreSQL under a **Knowledge Base** of type `database` linked to **document** (source_type = database, connection_ref or snapshot_path).
3. **During call**: Same RAG flow: user says something, query, vector search over DB-derived chunks, context to LLM.
4. **Sync**: Re-run export + chunk + embed periodically or on "refresh" so vector DB stays in sync with DB.

**MCP server**: Exposes **list_tables**, **sync_database_to_kb** (trigger job), **get_sync_status**. Credentials (DB connection) stored per **business_id**; only that business's DB is used for their bots.

---

### 9.8 WhatsApp MCP server (templates, during/after call, AI chat)

**Providers**: WhatsApp Business Cloud (Meta), Twilio WhatsApp Business.

**Purpose**:

- **Triggers**: Send WhatsApp messages **during** or **after** the call using **templates** (image, message, video, button, etc.). E.g. "Send follow-up with brochure after call" or "Send OTP during call".
- **AI chat**: Same voice AI bot can continue the conversation on **WhatsApp** (text chat). User messages WhatsApp; your backend runs the same LLM/agent and responds in text. Optionally use TTS for outbound voice and WhatsApp for text in parallel.

**Tools / capabilities**:

- **Send template** (template name, params, recipient, optional "after_call" / "during_call" timing).
- **Receive inbound** WhatsApp message and run through same agent (LLM) and reply via WhatsApp.
- Templates: image, video, text, button (quick reply, call, URL). Stored per **business_id** (and approved in Meta Business Manager).

**Implementation**: MCP server (or Django + webhook receiver) for WhatsApp Business API. Credentials (phone number ID, access token, Twilio SID/token if using Twilio) per **business_id**. Webhook URL for inbound messages; when call ends (or at a trigger point during call), worker or Django calls "send template" with the chosen template. For "same agent on WhatsApp", route inbound WhatsApp to the same agent logic (LLM + tools) and post reply back via WhatsApp API.

---

### 9.9 SMS and telephony MCP servers

**Providers**: Twilio, Exotel, Plivo, Amazon Connect, Tata Smartflo.

**Purpose**: **Telephony** (inbound/outbound voice calls) and **SMS** (send/receive). Each client chooses **provider** and can customize behavior (e.g. caller ID, SMS templates, IVR).

**Capabilities**:

- **Telephony**: Route inbound calls to LiveKit (e.g. Twilio to your backend, create room, issue token, connect PSTN to room). Outbound: agent or API triggers "call this number" using the chosen provider.
- **SMS**: Send SMS (e.g. post-call summary, OTP, link); receive SMS and optionally route to same LLM agent (like WhatsApp).
- **Customizable**: Per **business_id** (or per **bot_id**): choose provider, set credentials (account SID, auth token, etc.), set defaults (from number, SMS template, etc.).

**Implementation**: One MCP server (or Django app) per provider; **business_integration** stores provider type and credentials. Optional **unified telephony/SMS API** in Django that selects provider by business_id and proxies requests.

---

## 10. Voice Library and Voice Cloning

### 10.1 Voice library (browse, listen)

- **Purpose**: Central place to **see all voices** available for TTS (ElevenLabs, Sarvam, custom clones). Admin can **listen to samples** and assign a voice to a bot.
- **Storage**: **voice_library** (or **voice**): **id**, **provider** (elevenlabs, sarvam, custom_clone), **external_voice_id** (provider's ID), **name**, **language**, **sample_audio_url** (optional), **metadata** (JSON: gender, style, etc.), **business_id** (optional; null = platform-wide), **created_at**.
- **APIs**: `GET /api/voices` (list; filter by provider, language, business_id). `GET /api/voices/<id>/sample` (stream or redirect to sample_audio_url). Used by bot creation UI to pick default voice and per-language voice.

### 10.2 Voice cloning (upload or prompt)

- **Clone by upload**: User uploads **audio** (e.g. 5 minutes to 1 hour). Backend sends to provider (e.g. ElevenLabs voice clone API) and gets a new **voice_id**; save in **voice_library** with **source** = upload, **source_audio_url** (or file path). That voice can then be assigned to a bot.
- **Clone by prompt**: User provides a **text description** (e.g. "calm male, British accent"). Backend calls provider's "generate voice from description" (if supported) and gets a new **voice_id**; save in **voice_library** with **source** = prompt, **prompt_text**.
- **Scoping**: Cloned voices can be **business-scoped** (business_id set) so only that tenant can use them. Platform voices (business_id null) are available to all.

### 10.3 Database additions for voice library

- **voice_library**: id, business_id (nullable), provider, external_voice_id, name, language, sample_audio_url, source (platform, upload, prompt), source_audio_url or prompt_text (nullable), metadata (JSON), created_at, updated_at.

---

## 11. Transcript: End-to-End

### 11.1 Where transcript comes from

- **User side**: STT (Deepgram/Sarvam) turns **user audio** into **text** → that text is the **user transcript**.
- **Agent side**: **LLM output** (before TTS) is the **agent transcript** (what the agent “said”).

### 8.2 Where it's stored it’s stored

- **User transcript**: **user_turn.transcribed_text** (and **detected_language**, **stt_provider**, **created_at**).
- **Agent transcript**: **agent_response.response_text** (and **created_at**, **language_used**).

### 8.3 How you "get" the transcript you “get” the transcript

- **Real-time** (during call): Worker can send transcript events to frontend via LiveKit **data messages** (optional), or frontend only gets it after the turn is stored and polled.
- **After the fact**:  
  - **GET /api/conversations/<id>/transcript**  
    - Returns ordered list of turns: for each **user_turn**, one object with role=user, text=transcribed_text, language, created_at; for each **agent_response** linked to that turn, one object with role=agent, text=response_text, created_at.  
  - **GET /api/conversations/<id>/turns**  
    - Same data in a turn-centric shape (each turn = user message + list of agent responses).  
  - **Export**: Same data as JSON/CSV or plain text (e.g. “User: … Agent: …”) for download.

### 11.4 Optional: live transcript in UI

- Worker, after each STT result, can send a **data message** over LiveKit (e.g. `{ "type": "transcript", "role": "user", "text": "..." }`).  
- When agent response is ready, send `{ "type": "transcript", "role": "agent", "text": "..." }`.  
- Frontend subscribes to data channel and updates on-screen transcript in real time.  
- You still persist the same data in Django for history and export.

---

## 12. Summary Checklist

| Topic | What you have |
|-------|----------------|
| **Backend workflow** | Django issues tokens and stores data; LiveKit (local) does real-time; worker does STT → RAG → LLM → TTS and POSTs turns to Django. |
| **Database (PostgreSQL)** | User/Session, Conversation, UserTurn, AgentResponse, AgentConfig, APIUsage, KnowledgeBase, Document, Chunk (metadata). |
| **Queue & cache (Redis)** | Celery broker and result backend; rate limiting (daily/concurrent call limits); cache for agent config; per-call state; distributed locks. |
| **Tech stack (bot settings & system tools)** | Django + DRF; Celery; guardrails (NeMo / Guardrails AI / presidio); webhooks (httpx, HMAC); Redis for limits; asyncio in worker; LiveKit data channel for client events; pydub/soundfile for ambience. |
| **Vector DB** | Stores chunk embeddings (and optionally text); used for RAG retrieval; can be pgvector or external; chunk metadata in PostgreSQL. |
| **Knowledge base** | Documents → chunks → embed → vector DB; at query time: embed transcript → similarity search → chunks → LLM context. |
| **Transcript** | User = STT → user_turn.transcribed_text; agent = LLM → agent_response.response_text; retrieved via conversation transcript API and optional export. |

This is the full backend workflow, database architecture, knowledge base + vector DB integration, and transcript flow as intended for what we’re building.
