Voice AI Conversations with LiveKit and Streaming AI Pipelines

What is this Project

This is a real-time voice AI agent that our customers use at Rhythmiq, and we use it for customer demos. The frontend is served via a customized fork of the LiveKit agents React repo, which you can find here.

I had to make a few tweaks to make it work directly with the existing system because it relies on a bunch of metadata to fetch agent config in real time from the phone number that is receiving the call.

I am the sole developer responsible for designing, implementing, deploying to production, and scaling this entire system.

On the backend, we are using LiveKit for real-time communication, the livekit worker processes:

Speech-to-Text (STT) - Converts spoken words into text
Large Language Model (LLM) - Generates intelligent responses
Text-to-Speech (TTS) - Converts responses back to natural speech

The codebase for this LiveKit worker resides under the company GitHub Account in a private repository. If you have any questions, please reach out—I’d love to give you a walkthrough!

The main use cases are Inbound and Outbound AI calling. We have seamless integration with multiple telephony providers over SIP (Twilio, Plivo, Teler, etc.). The system is used for a plethora of use cases, such as filtering interested candidates before transferring them to a human agent, or reaching out to existing customers to collect feedback. Recordings and transcripts are readily available after calls on Rhythmiq’s dashboard.

The Challenge

We ran into many challenges when we started implementing a voice AI solution. I had prior experience with VoIP systems from my time at FreJun, but not much with voice AI or implementing such solutions. Here’s what we had to implement:

Ultra-low latency: Users expect natural conversation flow, requiring end-to-end response times under 1 second
Multi-tenancy: Different customers need different assistants, which require different configurations, knowledge bases, and voice characteristics
Scalability: The system must handle concurrent calls without degradation
Cost efficiency: Real-time AI inference is expensive, and India is a price-sensitive market

Our first instinct was to make a sort of wrapper and use Elevenlabs, Vapi or some other providers and keep the customer specific logic in our application. As you might imagine, the costs were too high for us to have any business use case for this. Even when we asked Elevenlabs for a very high usage pricing quote for a custom plan, the price they gave us was still more than 3x of what this pipeline achieves for us. But that makes sense considering they will be taking care of all the technical scale and load for you.

So we moved on to building our own agents. LiveKit allows us to interchange any step of the pipeline (STT, LLM, and TTS) and provide custom price-quality tradeoff-based agents that exactly fit the needs of each individual customer.

System Architecture

The system follows a pipeline architecture where audio flows through discrete stages: Speech-to-Text (STT) → Intelligent RAG Decision (whether to use RAG or not) → Retrieval (if needed) → Language Model (LLM) → Text-to-Speech (TTS). Each stage is independently configurable and can be swapped based on requirements.

graph LR
    subgraph "Phone Network"
        Phone[Phone Caller]
        Telco[Telco Network]
        Frejun[SIP Provider]
    end
    
    subgraph "LiveKit Infrastructure"
        LiveKitSIP[LiveKit SIP Gateway]
        LiveKitRoom[LiveKit Room]
    end
    
    subgraph "Voice AI Agent"
        STT[Speech-to-Text]
        VAD[Voice Activity Detection]
        RAGDecision[RAG Decision Engine]
        RAGRetrieval[Vector Search]
        LLM[Language Model]
        TTS[Text-to-Speech]
    end
    
    subgraph "Data Layer"
        PostgreSQL[(PostgreSQL<br/>Configuration & Embeddings)]
        Redis[(Redis<br/>Vector Search & Cache)]
    end
    
    Phone -->|PSTN| Telco
    Telco -->|SIP| Frejun
    Frejun -->|SIP Trunk| LiveKitSIP
    LiveKitSIP <-->|WebRTC| LiveKitRoom
    LiveKitRoom -->|Audio Stream| STT
    STT -->|Text| RAGDecision
    RAGDecision -->|Query| RAGRetrieval
    RAGRetrieval -->|Context| LLM
    LLM -->|Response Text| TTS
    TTS -->|Audio Stream| LiveKitRoom
    LiveKitRoom -->|WebRTC| LiveKitSIP
    LiveKitSIP -->|SIP| Frejun
    
    RAGDecision -.->|Config| PostgreSQL
    RAGRetrieval -.->|Vector Search| Redis
    RAGRetrieval -.->|Load Embeddings| PostgreSQL

Core Components

LiveKit Integration: LiveKit serves as our real-time communication backbone, handling WebRTC connections, SIP bridging (for phone calls), and audio streaming. The agent connects to LiveKit rooms and processes audio frames in real-time.

Dynamic Configuration System: Instead of hardcoding configurations, the system loads assistant-specific settings from PostgreSQL at runtime. This enables true multi-tenancy where each phone number can map to a different assistant with unique STT, LLM, TTS, and RAG settings.

Redis Vector Search: We use Redis with RediSearch for vector similarity search. Embeddings are stored as binary vectors (1536 dimensions from OpenAI’s text-embedding-3-small) and retrieved using KNN (K-Nearest Neighbors) queries.

PostgreSQL Database: Stores assistant configurations, phone number mappings, and document embeddings. We use asyncpg for async connection pooling to handle concurrent requests efficiently.

Optimizations

Although we do use a set of different providers and locally running models for each step, here are some notable optimizations I think stand out in the system that we did to improve latency greatly:

Preemptive Generation

One of our key optimizations is preemptive generation: the LLM begins generating a response as soon as the user starts speaking, before they finish. This is enabled by LiveKit’s VAD (Voice Activity Detection) and turn detection, which predicts when the user is likely done speaking.

This can reduce perceived latency by 500ms-1s, making conversations feel more natural.

Faster Whisper (Custom GPU Implementation): For scenarios requiring on-premise deployment or specific model customization, we’ve built a custom Whisper plugin using faster-whisper. This implementation includes:

GPU acceleration with CUDA support
Model warmup to reduce first-inference latency
VAD (Voice Activity Detection) filtering to reduce false positives
Custom initial prompts for domain-specific optimization

Streaming vs. Batch: We use streaming STT where available (Deepgram) to minimize latency. For Whisper, we batch process but optimize with VAD to reduce unnecessary inference.

Language Detection: Language is specified per-assistant in configuration, avoiding runtime detection overhead.

Audio Preprocessing: Audio frames are combined and converted to the format expected by each provider, with format-specific optimizations.

Kokoro TTS (Custom GPU Implementation): For certain use cases, we also run a custom plugin for Kokoro 82M, which is a relatively lightweight model that gives good results for English. Although it’s not a state-of-the-art model, it saves us significantly on costs.

Intelligent RAG System

Not every user query requires knowledge base retrieval. Simple greetings, confirmations, or follow-up questions can be answered directly by the LLM. However, questions about specific products, pricing, or features require accurate, up-to-date information from the knowledge base.

The challenge: How do we decide when to use RAG without adding significant latency?

Hybrid Decision-Making Architecture

We’ve implemented a two-tier decision system that combines fast heuristics with LLM-based classification for uncertain cases.

flowchart LR
    Start[User Message] --> FastPath{Heuristic Check}
    FastPath -->|Obvious NO_RAG<br/>Greeting/Confirmation| SkipRAG[Skip RAG<br/>Direct to LLM]
    FastPath -->|Obvious NEEDS_RAG<br/>Question with keywords| DoRAG[Execute RAG]
    FastPath -->|Uncertain| LLMCheck[LLM Decision<br/>100-200ms]
    LLMCheck -->|RAG Needed| DoRAG
    LLMCheck -->|No RAG| SkipRAG
    DoRAG --> Embed[Generate Embeddings<br/>50-100ms]
    Embed --> Search[Redis KNN Search<br/>10-50ms]
    Search --> Context[Retrieve Context]
    Context --> LLM[LLM with Context]
    SkipRAG --> LLM
    LLM --> TTS[Text-to-Speech]

Fast-Path Heuristics

For obvious cases, we use regex-based pattern matching that executes in microseconds:

def is_obvious_no_rag(text: str, language_code: str) -> bool:
    """Fast check for greetings, confirmations, small talk."""
    text_lower = text.lower().strip()
    
    # Very short messages
    if len(text_lower) < 10:
        if text_lower in ['?', '!', '.', ',', '...']:
            return True
        short_patterns = [
            r'^(hi|hey|hello|hii)$',
            r'^(yes|no|ok|okay|sure)$',
            r'^(thanks|thank you|thx)$',
        ]
        for pattern in short_patterns:
            if re.match(pattern, text_lower):
                return True
    
    # Small talk patterns
    small_talk_patterns = [
        r'^(how are you|how\'?s it going)$',
        r'^(got it|understood|i see)$',
    ]
    for pattern in small_talk_patterns:
        if re.match(pattern, text_lower):
            return True
    
    return False

def is_obvious_needs_rag(text: str) -> bool:
    """Fast check for questions requiring knowledge base."""
    text_lower = text.lower().strip()
    
    # Questions with specific indicators
    if len(text_lower) > 20 and '?' in text:
        specific_indicators = [
            r'(what is|what are|what does)',
            r'(how to|how do|how can)',
            r'(tell me about|explain|describe)',
            r'(information about|more about|details about)',
        ]
        for pattern in specific_indicators:
            if re.search(pattern, text_lower):
                return True
    
    return False

LLM-Based Classification

For uncertain cases, we use a lightweight LLM (Llama 3.1 8B) to make the decision. This adds 100-200ms but ensures accuracy.

Key Optimizations:

KNN=1: We retrieve only the most relevant document to minimize latency
Cosine Similarity: Redis uses cosine distance for semantic similarity
Assistant Filtering: Each search is scoped to the specific assistant’s knowledge base

User Feedback: Filler Words

To provide immediate feedback when RAG is executing, we speak a filler word/phrase before starting retrieval:

if needs_rag_result:
    # Get language-appropriate filler word
    filler = self._get_filler_word(self.language_code, is_complex_question)
    
    # Speak immediately (non-blocking)
    await self.session.say(filler, allow_interruptions=False)
    
    # Execute RAG in parallel
    q_context = await self._fetch_rag_context(rag_content)

Filler words are localized:

English: “One moment”, “Let me check”, “Hmm, let me see”
Hindi: “एक मिनट दीजिए”, “ठीक है, मैं देखता हूं” etc.

This creates a more natural conversation flow where users know the system is processing their request.

Future Growth Opportunities

We’re building new features into our agents every day, from enhanced tool calling to introducing new models at each step of the pipeline. Multimodal models, once prices come down enough, are sure to be added to the mix. If you have any ideas or feedback after using this, or would like to contribute in any way, please let me know.

Also, if you’re facing any issues using the demo link above, please reach out. I’ll make sure to get back to you!

— Ray