In our previous post, we introduced the high-level architecture of building a Voice AI bot using baresip. Today, we’re going deep into the “nervous system” of the bot: the OpenAI Realtime API implementation.
At Sipfront, we don’t just use these tools; we tear them apart to understand exactly how they behave under stress. This know-how allows us to build the most robust test automation for our customers’ voice bots.
The “Burn-In” Flow: Ephemeral Keys and Session Updates
One of the most critical aspects of a secure and performant voice bot is how it initializes. You cannot simply hardcode an API key into a distributed client. Instead, we use a two-step “burn-in” process.
1. The Ephemeral Key (The Handshake)
Before the bot even thinks about SIP, our backend API requests a short-lived client secret from OpenAI. This key is valid for only one session and expires quickly.
Request:
1{
2 "model": "gpt-4o-realtime-preview",
3 "expires_after": {
4 "anchor": "created_at",
5 "seconds": 3600
6 },
7 "session": {
8 "type": "realtime",
9 "instructions": "You are a helpful assistant.",
10 "voice": "alloy",
11 "input_audio_format": "pcm16",
12 "output_audio_format": "pcm16",
13 "turn_detection": {
14 "type": "server_vad",
15 "threshold": 0.5,
16 "prefix_padding_ms": 300,
17 "silence_duration_ms": 500
18 }
19 }
20}
Response:
1{
2 "client_secret": {
3 "value": "ek_abc123...",
4 "expires_at": 1712345678
5 }
6}
We then use this client_secret.value as a Bearer token to authenticate the WebSocket connection towards the OpenAI Realtime API. This established socket becomes the primary conduit for the actual communication between our SIP client and the AI model, carrying both control events and raw audio data.
2. The Session Update (The Capabilities)
Once the WebSocket is established in our SIP client, we perform a session.update. This is where we “burn in” the specific capabilities the bot needs for this specific call.
1{
2 "type": "session.update",
3 "session": {
4 "instructions": "You are a helpful assistant. Keep your answers short and concise.",
5 "tool_choice": "none"
6 }
7}
The Anatomy of a Voice Bot Conversation
Here is the complete flow of a session, from token creation to the AI’s first word:
sequenceDiagram
participant Caller as SIP Caller
participant Baresip as Baresip (openai_rt)
participant API as Sipfront API
participant OpenAI as OpenAI Realtime
rect rgb(0, 0, 0, 0)
Note over Baresip, OpenAI: 1. Ephemeral Token Creation (The Handshake)
Baresip->>API: Request Session Token
API->>OpenAI: POST /v1/realtime/client_secrets
OpenAI-->>API: { "client_secret": { "value": "ek_..." } }
API-->>Baresip: Ephemeral Token
end
Note over Baresip, OpenAI: 2. WebSocket Setup
Baresip->>OpenAI: WSS Connect (Authorization: Bearer ek_...)
OpenAI-->>Baresip: session.created
Baresip->>OpenAI: session.update (Instructions, VAD, Voice)
OpenAI-->>Baresip: session.updated
Note over Caller, OpenAI: 3. Active Call & Audio Flow
Caller->>Baresip: RTP Audio (G.711/Opus)
Baresip->>Baresip: Resample to 24kHz PCM
Baresip->>OpenAI: input_audio_buffer.append (Base64)
Note over OpenAI: Turn Detection (VAD)
OpenAI->>Baresip: input_audio_buffer.speech_started
Baresip->>Baresip: Clear Injection Buffer (Interruption)
OpenAI->>Baresip: response.output_audio.delta (Base64)
Baresip->>Baresip: Decode & Buffer PCM
Baresip->>Caller: RTP Audio (from Injection Buffer)
Tool-Calling: Giving the Bot Hands
A voice bot that can’t do anything is just a fancy walkie-talkie. Tool-calling is what makes it an agent. In our implementation, we define tools like hangup_call, send_dtmf, and api_call directly in the SIP client. This gives our bots the capability for example to auto-discover a full IVR menu and send the result to a web API of yours when you perform IVR tests. Likewise it could send regular real-time updates and summaries about the test call content to your infrastructure, if you choose to prompt it like that.
Initializing Tools
To use tools, they must be declared during the session initialization. This happens in the session.update event. Here is how we initialize our three core tools:
1{
2 "type": "session.update",
3 "session": {
4 "tools": [
5 {
6 "type": "function",
7 "name": "hangup_call",
8 "description": "Ends the current SIP call immediately."
9 },
10 {
11 "type": "function",
12 "name": "send_dtmf",
13 "description": "Sends DTMF tones (digits) to the caller.",
14 "parameters": {
15 "type": "object",
16 "properties": {
17 "digits": {
18 "type": "string",
19 "description": "The sequence of digits to send (0-9, *, #)."
20 }
21 },
22 "required": ["digits"]
23 }
24 },
25 {
26 "type": "function",
27 "name": "api_call",
28 "description": "Performs an external API request to fetch or update data.",
29 "parameters": {
30 "type": "object",
31 "properties": {
32 "endpoint": { "type": "string" },
33 "method": { "type": "string", "enum": ["GET", "POST"] },
34 "payload": { "type": "string" }
35 },
36 "required": ["endpoint", "method"]
37 }
38 }
39 ],
40 "tool_choice": "auto"
41 }
42}
When the LLM decides to use a tool, it sends a function_call item. Our module parses the arguments and executes the corresponding action:
1// Example: The AI decides to hang up the call
2{
3 "type": "response.output_item.done",
4 "item": {
5 "type": "function_call",
6 "name": "hangup_call",
7 "call_id": "call_123",
8 "arguments": "{}"
9 }
10}
Our implementation catches this, triggers the SIP BYE, and sends the result back to OpenAI so the “brain” knows the hand successfully moved.
Interruption (Barge-in) and VAD Tuning
The difference between a “bot” and a “person” is how they handle interruptions. If you speak while the bot is talking, it must stop immediately.
We achieve this by listening for the input_audio_buffer.speech_started event. The moment this arrives, we flush our circular injection buffer, effectively stopping the bot’s speech in its tracks:
sequenceDiagram
participant OpenAI as OpenAI Realtime
participant Baresip as Baresip (openai_rt)
participant Buffer as Injection Buffer
participant Caller as SIP Caller
Note over Baresip, Buffer: Bot is currently speaking
Buffer->>Baresip: Read PCM Chunks
Baresip->>Caller: RTP Audio
OpenAI->>Baresip: input_audio_buffer.speech_started
Note right of Baresip: INTERRUPT DETECTED
Baresip->>Buffer: Flush / Clear Buffer
Note over Buffer: Buffer Empty
Baresip->>Caller: Silence / Comfort Noise
Note over Caller: Bot stops speaking instantly
Tuning for Speed: Server VAD vs. Semantic VAD
To make the bot feel “snappy,” we tune the Voice Activity Detection (VAD) parameters. There are two main approaches to VAD in the OpenAI Realtime API:
-
Server VAD (Traditional): This is the default mode where the server uses a dedicated audio processing model to detect when speech starts and ends. It’s extremely fast and reliable for simple turn-taking.
threshold: Sensitivity of voice detection (default: 0.5).silence_duration_ms: How long to wait after you stop speaking (we often tune this to 300-500ms for fast-paced talkers).prefix_padding_ms: How much audio before the speech detection to include (crucial for catching the first syllable).
-
Semantic VAD (Advanced): In this mode, the LLM itself helps decide if the user has finished their thought. This is much better at handling natural speech patterns, like when a user pauses to think mid-sentence, but it can introduce slightly more latency as the “brain” needs to process the context.
eagerness: This parameter controls how quickly the model responds. A higher eagerness (e.g.,high) makes the bot jump in as soon as it thinks you might be done, while a lower value (e.g.,low) makes it more patient, waiting to be sure you’ve finished your thought.
At Sipfront, we typically recommend Server VAD for high-performance voice agents where low latency is the top priority, but we use Semantic VAD in our test suites to simulate more complex human interactions and verify how well a bot handles mid-sentence pauses. For a detailed documentation on the different settings, check the OpenAI Realtime VAD Guide.
Why Sipfront?
Building this implementation from the ground up, down to the last line of C code and every JSON event, gives us a unique advantage. We don’t just build voice bots; we build the systems that test them.
Understanding the “nervous system” of a bot allows us to know exactly what to measure and where to look when a system under test isn’t behaving. When we see a bot struggling with high latency, we know to check the VAD thresholds or the WebSocket pacing. When a bot fails to interrupt, we know to look at the buffer management or the speech_started event handling.
This deep, code-level knowledge is what allows Sipfront to provide the most authoritative test automation in the industry. We don’t just tell you that your bot is slow; we help you understand why it’s slow and how to fix it.
If you are building the next generation of Voice AI, you need a testing partner that knows the code as well as you do. Contact us to see how we can help you benchmark and secure your AI voice agents.
comments powered by Disqus