In our previous post, we introduced the high-level architecture of building a Voice AI bot using baresip. Today, we’re going deep into the “nervous system” of the bot: the OpenAI Realtime API implementation.

At Sipfront, we don’t just use these tools; we tear them apart to understand exactly how they behave under stress. This know-how allows us to build the most robust test automation for our customers’ voice bots.

The “Burn-In” Flow: Ephemeral Keys and Session Updates

One of the most critical aspects of a secure and performant voice bot is how it initializes. You cannot simply hardcode an API key into a distributed client. Instead, we use a two-step “burn-in” process.

1. The Ephemeral Key (The Handshake)

Before the bot even thinks about SIP, our backend API requests a short-lived client secret from OpenAI. This key is valid for only one session and expires quickly.

Request:

 1{
 2    "model": "gpt-4o-realtime-preview",
 3    "expires_after": {
 4        "anchor": "created_at",
 5        "seconds": 3600
 6    },
 7    "session": {
 8        "type": "realtime",
 9        "instructions": "You are a helpful assistant.",
10        "voice": "alloy",
11        "input_audio_format": "pcm16",
12        "output_audio_format": "pcm16",
13        "turn_detection": {
14            "type": "server_vad",
15            "threshold": 0.5,
16            "prefix_padding_ms": 300,
17            "silence_duration_ms": 500
18        }
19    }
20}

Response:

1{
2    "client_secret": {
3        "value": "ek_abc123...",
4        "expires_at": 1712345678
5    }
6}

We then use this client_secret.value as a Bearer token to authenticate the WebSocket connection towards the OpenAI Realtime API. This established socket becomes the primary conduit for the actual communication between our SIP client and the AI model, carrying both control events and raw audio data.

2. The Session Update (The Capabilities)

Once the WebSocket is established in our SIP client, we perform a session.update. This is where we “burn in” the specific capabilities the bot needs for this specific call.

1{
2    "type": "session.update",
3    "session": {
4        "instructions": "You are a helpful assistant. Keep your answers short and concise.",
5        "tool_choice": "none"
6    }
7}

The Anatomy of a Voice Bot Conversation

Here is the complete flow of a session, from token creation to the AI’s first word:

  sequenceDiagram
    participant Caller as SIP Caller
    participant Baresip as Baresip (openai_rt)
    participant API as Sipfront API
    participant OpenAI as OpenAI Realtime

    rect rgb(0, 0, 0, 0)
    Note over Baresip, OpenAI: 1. Ephemeral Token Creation (The Handshake)
    Baresip->>API: Request Session Token
    API->>OpenAI: POST /v1/realtime/client_secrets
    OpenAI-->>API: { "client_secret": { "value": "ek_..." } }
    API-->>Baresip: Ephemeral Token
    end

    Note over Baresip, OpenAI: 2. WebSocket Setup
    Baresip->>OpenAI: WSS Connect (Authorization: Bearer ek_...)
    OpenAI-->>Baresip: session.created
    Baresip->>OpenAI: session.update (Instructions, VAD, Voice)
    OpenAI-->>Baresip: session.updated

    Note over Caller, OpenAI: 3. Active Call & Audio Flow
    Caller->>Baresip: RTP Audio (G.711/Opus)
    Baresip->>Baresip: Resample to 24kHz PCM
    Baresip->>OpenAI: input_audio_buffer.append (Base64)
    
    Note over OpenAI: Turn Detection (VAD)
    OpenAI->>Baresip: input_audio_buffer.speech_started
    Baresip->>Baresip: Clear Injection Buffer (Interruption)
    
    OpenAI->>Baresip: response.output_audio.delta (Base64)
    Baresip->>Baresip: Decode & Buffer PCM
    Baresip->>Caller: RTP Audio (from Injection Buffer)

Tool-Calling: Giving the Bot Hands

A voice bot that can’t do anything is just a fancy walkie-talkie. Tool-calling is what makes it an agent. In our implementation, we define tools like hangup_call, send_dtmf, and api_call directly in the SIP client. This gives our bots the capability for example to auto-discover a full IVR menu and send the result to a web API of yours when you perform IVR tests. Likewise it could send regular real-time updates and summaries about the test call content to your infrastructure, if you choose to prompt it like that.

Initializing Tools

To use tools, they must be declared during the session initialization. This happens in the session.update event. Here is how we initialize our three core tools:

 1{
 2    "type": "session.update",
 3    "session": {
 4        "tools": [
 5            {
 6                "type": "function",
 7                "name": "hangup_call",
 8                "description": "Ends the current SIP call immediately."
 9            },
10            {
11                "type": "function",
12                "name": "send_dtmf",
13                "description": "Sends DTMF tones (digits) to the caller.",
14                "parameters": {
15                    "type": "object",
16                    "properties": {
17                        "digits": {
18                            "type": "string",
19                            "description": "The sequence of digits to send (0-9, *, #)."
20                        }
21                    },
22                    "required": ["digits"]
23                }
24            },
25            {
26                "type": "function",
27                "name": "api_call",
28                "description": "Performs an external API request to fetch or update data.",
29                "parameters": {
30                    "type": "object",
31                    "properties": {
32                        "endpoint": { "type": "string" },
33                        "method": { "type": "string", "enum": ["GET", "POST"] },
34                        "payload": { "type": "string" }
35                    },
36                    "required": ["endpoint", "method"]
37                }
38            }
39        ],
40        "tool_choice": "auto"
41    }
42}

When the LLM decides to use a tool, it sends a function_call item. Our module parses the arguments and executes the corresponding action:

 1// Example: The AI decides to hang up the call
 2{
 3    "type": "response.output_item.done",
 4    "item": {
 5        "type": "function_call",
 6        "name": "hangup_call",
 7        "call_id": "call_123",
 8        "arguments": "{}"
 9    }
10}

Our implementation catches this, triggers the SIP BYE, and sends the result back to OpenAI so the “brain” knows the hand successfully moved.

Interruption (Barge-in) and VAD Tuning

The difference between a “bot” and a “person” is how they handle interruptions. If you speak while the bot is talking, it must stop immediately.

We achieve this by listening for the input_audio_buffer.speech_started event. The moment this arrives, we flush our circular injection buffer, effectively stopping the bot’s speech in its tracks:

  sequenceDiagram
    participant OpenAI as OpenAI Realtime
    participant Baresip as Baresip (openai_rt)
    participant Buffer as Injection Buffer
    participant Caller as SIP Caller

    Note over Baresip, Buffer: Bot is currently speaking
    Buffer->>Baresip: Read PCM Chunks
    Baresip->>Caller: RTP Audio

    OpenAI->>Baresip: input_audio_buffer.speech_started
    Note right of Baresip: INTERRUPT DETECTED
    
    Baresip->>Buffer: Flush / Clear Buffer
    Note over Buffer: Buffer Empty
    
    Baresip->>Caller: Silence / Comfort Noise
    Note over Caller: Bot stops speaking instantly

Tuning for Speed: Server VAD vs. Semantic VAD

To make the bot feel “snappy,” we tune the Voice Activity Detection (VAD) parameters. There are two main approaches to VAD in the OpenAI Realtime API:

  1. Server VAD (Traditional): This is the default mode where the server uses a dedicated audio processing model to detect when speech starts and ends. It’s extremely fast and reliable for simple turn-taking.

    • threshold: Sensitivity of voice detection (default: 0.5).
    • silence_duration_ms: How long to wait after you stop speaking (we often tune this to 300-500ms for fast-paced talkers).
    • prefix_padding_ms: How much audio before the speech detection to include (crucial for catching the first syllable).
  2. Semantic VAD (Advanced): In this mode, the LLM itself helps decide if the user has finished their thought. This is much better at handling natural speech patterns, like when a user pauses to think mid-sentence, but it can introduce slightly more latency as the “brain” needs to process the context.

    • eagerness: This parameter controls how quickly the model responds. A higher eagerness (e.g., high) makes the bot jump in as soon as it thinks you might be done, while a lower value (e.g., low) makes it more patient, waiting to be sure you’ve finished your thought.

At Sipfront, we typically recommend Server VAD for high-performance voice agents where low latency is the top priority, but we use Semantic VAD in our test suites to simulate more complex human interactions and verify how well a bot handles mid-sentence pauses. For a detailed documentation on the different settings, check the OpenAI Realtime VAD Guide.

Why Sipfront?

Building this implementation from the ground up, down to the last line of C code and every JSON event, gives us a unique advantage. We don’t just build voice bots; we build the systems that test them.

Understanding the “nervous system” of a bot allows us to know exactly what to measure and where to look when a system under test isn’t behaving. When we see a bot struggling with high latency, we know to check the VAD thresholds or the WebSocket pacing. When a bot fails to interrupt, we know to look at the buffer management or the speech_started event handling.

This deep, code-level knowledge is what allows Sipfront to provide the most authoritative test automation in the industry. We don’t just tell you that your bot is slow; we help you understand why it’s slow and how to fix it.

If you are building the next generation of Voice AI, you need a testing partner that knows the code as well as you do. Contact us to see how we can help you benchmark and secure your AI voice agents.

comments powered by Disqus