# Unified Transcription Service Guide

## Overview

The Unified Transcription Service provides intelligent language detection, automatic STT provider selection, quality validation, and retry logic for call transcriptions.

## Features

### 1. **Automatic Language Detection**
- Detects Hindi, English, Kannada, Tamil, Telugu, Malayalam, Bengali, Gujarati
- Routes to appropriate STT provider based on detected language
  - **Deepgram**: Hindi-English calls
  - **Sarvam**: Kannada, Tamil, Telugu, and other Indian languages

### 2. **Quality Validation**
The system validates transcripts for:
- Empty or very short transcripts
- Missing speaker segments
- Duration mismatches (transcribed duration << actual call duration)
- Only one speaker detected (unusual for phone calls)
- Repetitive or nonsensical text
- Too few words relative to duration

### 3. **Automatic Retry Logic**
- Up to 3 retry attempts
- Automatically switches between Deepgram and Sarvam providers
- Continues retrying until quality validation passes

## Quality Issues Detection

### Example: Bad Transcript
**Problem**: Call ID `88615207351765699523`
- **Actual call duration**: 99 seconds (13:35:23 to 13:37:02)
- **Transcribed duration**: 35.45 seconds
- **Issue**: Missing 64% of the call audio

This will be automatically detected and retried with:
1. First attempt: Sarvam (original)
2. Second attempt: Deepgram (retry)
3. Third attempt: Sarvam (retry)

## Usage

### Process a Single Call

```bash
cd /var/www/html/tatsat2/dashboard-backend

# Process specific call
python3 process_calls_unified.py --bid 7987 --call-id 88615207351765699523

# Reprocess a failed call
python3 process_calls_unified.py --bid 7987 --call-id 88615207351765699523 --reprocess
```

### Process All Pending Calls

```bash
# Process all pending calls for a business
python3 process_calls_unified.py --bid 7987

# Process only first 10 pending calls
python3 process_calls_unified.py --bid 7987 --limit 10
```

### Test Direct Transcription

```bash
# Test with audio URL
python3 unified_transcription_service.py "https://recordings.mcube.com/path/to/audio.wav"

# Test with expected duration for validation
python3 unified_transcription_service.py "https://recordings.mcube.com/path/to/audio.wav" 99
```

## Configuration

Required environment variables in `.env`:

```bash
# Deepgram (for Hindi-English)
DEEPGRAM_API_KEY=your_deepgram_key

# Sarvam (for Kannada and other Indian languages)
SARVAM_SUBSCRIPTION_KEY=your_sarvam_key

# Anthropic Claude (for transcript cleanup and translation)
ANTHROPIC_API_KEY=your_anthropic_key

# Database
DB_HOST=10.0.0.109
DB_PORT=3306
DB_USER=admin
DB_PASSWORD=your_db_password
DB_NAME=voicebot_cluster
```

## Output Format

The service returns structured data:

```python
{
    "transcript": "Cleaned and translated transcript",
    "raw_transcript": "Original transcript",
    "speaker_segments": [
        {
            "speaker": "Speaker 1",
            "text": "Hello",
            "start_time": 0.0,
            "end_time": 1.5
        },
        # ...
    ],
    "duration": 99.0,
    "num_speakers": 2,
    "stt_provider": "deepgram",  # or "sarvam"
    "language_detected": "hi-en",  # or "kn", "ta", etc.
    "processed_at": "2025-12-22T...",
    "quality_validation": {
        "is_valid": True,
        "attempts": 2,
        "providers_tried": ["sarvam", "deepgram"],
        "issues": []
    }
}
```

## Database Schema

Results are stored in `{bid}_sarvamresponse` table:

```sql
CREATE TABLE IF NOT EXISTS `{bid}_sarvamresponse` (
    id INT PRIMARY KEY AUTO_INCREMENT,
    callid VARCHAR(50) NOT NULL,
    transcript TEXT,
    raw_transcript TEXT,
    speaker_segments JSON,
    num_speakers INT,
    duration FLOAT,
    request_id VARCHAR(255),
    language VARCHAR(10),
    stt_provider VARCHAR(50),  -- 'deepgram' or 'sarvam'
    language_detected VARCHAR(10),
    raw_response TEXT,
    status INT DEFAULT 0,
    created_at DATETIME,
    updated_at DATETIME,
    INDEX idx_callid (callid),
    INDEX idx_stt_provider (stt_provider),
    INDEX idx_language (language_detected)
);
```

## Language Detection Logic

### Supported Languages

| Language | Code | STT Provider | Notes |
|----------|------|--------------|-------|
| Hindi | `hi` | Deepgram | Good accuracy |
| Hindi-English | `hi-en` | Deepgram | Code-switching supported |
| English | `en` | Deepgram | Best accuracy |
| Kannada | `kn` | Sarvam | Native support |
| Tamil | `ta` | Sarvam | Native support |
| Telugu | `te` | Sarvam | Native support |
| Malayalam | `ml` | Sarvam | Native support |
| Bengali | `bn` | Sarvam | Native support |
| Gujarati | `gu` | Sarvam | Native support |

### Detection Algorithm

1. Analyze Unicode character ranges in text
2. Identify primary and secondary languages
3. Route to appropriate STT provider
4. If first provider fails quality validation, retry with alternate

## Validation Thresholds

```python
# Duration mismatch threshold
DURATION_RATIO_THRESHOLD = 0.5  # Flag if <50% of expected

# Word density threshold
MIN_WORDS_PER_10_SECONDS = 1.0

# Repetition threshold
UNIQUE_WORD_RATIO_THRESHOLD = 0.4  # Flag if <40% unique

# Minimum transcript length
MIN_TRANSCRIPT_LENGTH = 10  # characters
```

## Error Handling

### Automatic Retries
- Network errors: Retry automatically
- Empty responses: Switch provider and retry
- Quality validation failures: Switch provider and retry
- Max retries: 3 attempts total

### Logging
All operations are logged with clear markers:
- 🎙️ Transcription start
- ✅ Success
- ❌ Failure/Error
- ⚠️ Warning/Quality issue
- 🔄 Retry attempt
- 📏 Duration measurement
- 📡 Provider selection

## Best Practices

1. **Always provide expected duration** when available for better validation
2. **Monitor quality validation reports** to identify systematic issues
3. **Use appropriate BID** to ensure correct database tables
4. **Review failed transcriptions** that exhaust all retries
5. **Check provider quotas** regularly

## Troubleshooting

### Issue: All retries exhausted
**Solution**: Check audio file accessibility, verify API keys, inspect logs

### Issue: Wrong language detected
**Solution**: Language detection is text-based; first attempt uses Deepgram by default

### Issue: Duration mismatch false positives
**Solution**: Adjust `DURATION_RATIO_THRESHOLD` if needed

### Issue: Import errors
**Solution**: Ensure `post call analysis` directory is in Python path

## Architecture

```
┌─────────────────────────────────────────────────────┐
│          UnifiedTranscriptionService                │
├─────────────────────────────────────────────────────┤
│                                                     │
│  ┌──────────────┐      ┌──────────────┐           │
│  │   Language   │      │   Quality    │           │
│  │   Detector   │      │  Validator   │           │
│  └──────────────┘      └──────────────┘           │
│                                                     │
│  ┌──────────────┐      ┌──────────────┐           │
│  │  Deepgram    │      │   Sarvam     │           │
│  │ Transcriber  │      │ Transcriber  │           │
│  └──────────────┘      └──────────────┘           │
│                                                     │
│           ↓                    ↓                    │
│  ┌─────────────────────────────────────┐           │
│  │    Automatic Retry Logic            │           │
│  │    (Max 3 attempts)                 │           │
│  └─────────────────────────────────────┘           │
│                                                     │
│           ↓                                         │
│  ┌─────────────────────────────────────┐           │
│  │    Claude Translation/Cleanup       │           │
│  └─────────────────────────────────────┘           │
│                                                     │
│           ↓                                         │
│  ┌─────────────────────────────────────┐           │
│  │    Database Storage                 │           │
│  └─────────────────────────────────────┘           │
└─────────────────────────────────────────────────────┘
```

## Examples

### Example 1: Hindi-English Call (Deepgram)
```bash
python3 process_calls_unified.py --bid 7987 --call-id 12345
```

**Output:**
```
✅ Successfully processed call 12345
Provider: deepgram
Language: hi-en
Quality attempts: 1
```

### Example 2: Kannada Call (Sarvam)
```bash
python3 process_calls_unified.py --bid 7987 --call-id 67890
```

**Output:**
```
✅ Successfully processed call 67890
Provider: sarvam
Language: kn
Quality attempts: 1
```

### Example 3: Bad Quality Retry
```bash
python3 process_calls_unified.py --bid 7987 --call-id 88615207351765699523 --reprocess
```

**Output:**
```
⚠️ Quality validation FAILED on attempt 1 with sarvam
Issues: Duration mismatch: transcribed 35.45s vs actual audio 99s
🔄 Retrying with different provider...

✅ Successfully processed call 88615207351765699523
Provider: deepgram
Language: hi-en
Quality attempts: 2
```
