Inside the Brain of a Voice AI: Google Gemini Live and baresip

Following our exploration of OpenAI Realtime, we now turn our attention to Google Gemini Live. While both platforms aim to provide low-latency, conversational AI, the underlying “nervous systems” have distinct characteristics.

At Sipfront, our deep dive into the code allows us to compare these giants not just on their features, but on their raw technical behavior.

The Initialization: Ephemeral Auth Tokens

Similar to OpenAI, Gemini Live uses a short-lived token for secure client-side authentication. However, the “burn-in” process is slightly different.

1. The Auth Token Request

Our backend API requests an ephemeral token from Google. Unlike OpenAI, where we define the session parameters during token creation, Gemini’s token request is more focused on the lifecycle of the token itself.

Request:

1{
2    "expireTime": "2026-01-29T11:00:00.000000+00:00",
3    "newSessionExpireTime": "2026-01-29T10:01:00.000000+00:00",
4    "uses": 1
5}

Response:

1{
2    "name": "auth_tokens/abc123...",
3    "expireTime": "2026-01-29T11:00:00Z"
4}

We use this name value as a token to authenticate the WebSocket. A key difference here is the header: while OpenAI uses Authorization: Bearer, Gemini uses Authorization: Token (for ephemeral tokens) or x-goog-api-key (for regular keys).

2. The Setup Message (The Capabilities)

Once the WebSocket is connected, we send a setup message. This is where Gemini’s session truly comes alive. This is where we define the model, the voice, and the system instructions.

 1{
 2    "setup": {
 3        "model": "models/gemini-2.5-flash-native-audio-preview-09-2025",
 4        "generationConfig": {
 5            "responseModalities": ["AUDIO"],
 6            "speechConfig": {
 7                "voiceConfig": {
 8                    "prebuiltVoiceConfig": { "voiceName": "Aoede" }
 9                }
10            },
11            "temperature": 0.7
12        },
13        "systemInstruction": {
14            "parts": [{ "text": "You are a helpful assistant." }]
15        }
16    }
17}

The Anatomy of a Gemini Conversation

The flow is remarkably similar to OpenAI, but with a few protocol-specific twists:

  sequenceDiagram
    participant Caller as SIP Caller
    participant Baresip as Baresip (openai_rt)
    participant API as Sipfront API
    participant Gemini as Google Gemini Live

    rect rgb(240, 240, 240)
    Note over Baresip, Gemini: 1. Auth Token Creation
    Baresip->>API: Request Session Token
    API->>Gemini: POST /v1alpha/auth_tokens
    Gemini-->>API: { "name": "auth_tokens/..." }
    API-->>Baresip: Ephemeral Token
    end

    Note over Baresip, Gemini: 2. WebSocket Setup
    Baresip->>Gemini: WSS Connect (Authorization: Token auth_tokens/...)
    Baresip->>Gemini: setup (Model, Voice, Instructions)
    Gemini-->>Baresip: setupComplete

    Note over Caller, Gemini: 3. Active Call & Audio Flow
    Caller->>Baresip: RTP Audio (G.711/Opus)
    Baresip->>Baresip: Resample to 16kHz PCM
    Baresip->>Gemini: realtime_input (Base64)
    
    Note over Gemini: Interruption Detection
    Gemini->>Baresip: serverContent { "interrupted": true }
    Baresip->>Baresip: Clear Injection Buffer
    
    Gemini->>Baresip: serverContent (modelTurn with Audio)
    Baresip->>Baresip: Decode & Buffer PCM (24kHz)
    Baresip->>Caller: RTP Audio

Key Technical Differences

While the high-level flow is similar, the “devil is in the details”:

Audio Sampling Rates:
- OpenAI: Expects 24kHz input and provides 24kHz output.
- Gemini: Expects 16kHz input and provides 24kHz output. This requires our baresip module to handle asymmetric resampling.
Interruption Handling:
- OpenAI: Sends a dedicated input_audio_buffer.speech_started event.
- Gemini: Includes an interrupted: true flag within the serverContent message.
Tool Calling:
- OpenAI: Uses a flat tools array in the session update.
- Gemini: Wraps tools in a function_declarations array inside a tools object.

Tool-Calling in Gemini: Giving the Bot Hands

A voice bot becomes an agent when it can interact with the real world. In our implementation, we define tools that allow Gemini to control the SIP call or interact with external APIs.

1. Defining the Tools

Tools are declared during the initial setup message. Each tool is defined within a function_declarations array. Here is how we define our three core tools for Gemini:

 1{
 2    "setup": {
 3        "tools": [
 4            {
 5                "function_declarations": [
 6                    {
 7                        "name": "hangup_call",
 8                        "description": "Ends the current SIP call immediately."
 9                    },
10                    {
11                        "name": "send_dtmf",
12                        "description": "Sends DTMF tones (digits) to the caller.",
13                        "parameters": {
14                            "type": "object",
15                            "properties": {
16                                "digits": {
17                                    "type": "string",
18                                    "description": "The sequence of digits to send (0-9, *, #)."
19                                }
20                            },
21                            "required": ["digits"]
22                        }
23                    },
24                    {
25                        "name": "api_call",
26                        "description": "Performs an external HTTP API request.",
27                        "parameters": {
28                            "type": "object",
29                            "properties": {
30                                "method": { "type": "string", "enum": ["GET", "POST"] },
31                                "uri": { "type": "string" },
32                                "body": { "type": "string" }
33                            },
34                            "required": ["method", "uri"]
35                        }
36                    }
37                ]
38            }
39        ]
40    }
41}

2. How Gemini Calls a Tool

When the model determines a tool is needed, it sends a toolCall message over the WebSocket. Unlike OpenAI’s flat event structure, Gemini groups these into a functionCalls array:

 1{
 2    "toolCall": {
 3        "functionCalls": [
 4            {
 5                "id": "call_gemini_123",
 6                "name": "send_dtmf",
 7                "args": {
 8                    "digits": "1234"
 9                }
10            }
11        ]
12    }
13}

3. Responding to the Tool Call

After our C module executes the action (e.g., triggering the DTMF tones in baresip), we must provide a tool_response back to Gemini. This allows the model to acknowledge the result and continue the conversation.

 1{
 2    "tool_response": {
 3        "functionResponses": [
 4            {
 5                "id": "call_gemini_123",
 6                "response": {
 7                    "result": "Successfully sent DTMF digits: 1234"
 8                }
 9            }
10        ]
11    }
12}

VAD and Turn Detection: Speed vs. Context

Gemini Live provides a sophisticated approach to turn detection, primarily through its Automatic Activity Detection (AAD). While OpenAI explicitly separates “Server VAD” and “Semantic VAD”, Gemini integrates these concepts into a unified, highly tunable mechanism.

1. Automatic Activity Detection (AAD)

Gemini’s default mode is essentially a high-performance Server VAD. It uses a dedicated model to detect speech patterns and silence, allowing the bot to respond with minimal latency. We tune this in the realtimeInputConfig:

startOfSpeechSensitivity: Controls how easily the model triggers on new audio.
silenceDurationMs: The “snappiness” of the response. We often set this to 400-600ms to balance natural pauses with quick turn-taking.
prefixPaddingMs: Ensures the very beginning of a sentence isn’t clipped.

2. The Semantic Layer

While Gemini doesn’t have a separate “Semantic VAD” toggle, its underlying architecture is inherently language-aware. The model uses the context of the conversation to help decide if a silence is a “thinking pause” or a “turn completion.”

This behavior is influenced by the generationConfig, specifically how we prompt the model. By adjusting the system instructions, we can make the bot more patient or more eager to jump in, effectively achieving a similar result to OpenAI’s semantic tuning.

Why Sipfront?

By implementing both OpenAI and Gemini at the code level, Sipfront provides an unparalleled perspective on Voice AI performance. We can tell you if your bot’s perceived “slowness” is due to the model’s processing time, the asymmetric audio resampling, or the specific VAD tuning of the platform.

Understanding these nuances is critical for building reliable, human-like voice agents. Whether you are using OpenAI or Gemini, Sipfront helps you understand exactly what is happening under the hood.

If you are navigating the choice between these platforms or optimizing your existing bot, contact us to see how our deep technical expertise can accelerate your journey.

Date: Thursday, January 29, 2026

Words: 1141

Reading time: 6 min

Categories Technical AI

Tags ai voice bots google gemini live baresip sip rtp technical

Inside the Brain of a Voice AI: Deep Dive into Google Gemini Live

Andreas Granig

The Initialization: Ephemeral Auth Tokens

1. The Auth Token Request

2. The Setup Message (The Capabilities)

The Anatomy of a Gemini Conversation

Key Technical Differences

Tool-Calling in Gemini: Giving the Bot Hands

1. Defining the Tools

2. How Gemini Calls a Tool

3. Responding to the Tool Call

VAD and Turn Detection: Speed vs. Context

1. Automatic Activity Detection (AAD)

2. The Semantic Layer

Why Sipfront?

Strategic Pillars

Ecosystem Solutions

Capabilities

Test Targets

Ecosystem

Inside the Brain of a Voice AI: Deep Dive into Google Gemini Live

Andreas Granig

The Initialization: Ephemeral Auth Tokens

1. The Auth Token Request

2. The Setup Message (The Capabilities)

The Anatomy of a Gemini Conversation

Key Technical Differences

Tool-Calling in Gemini: Giving the Bot Hands

1. Defining the Tools

2. How Gemini Calls a Tool

3. Responding to the Tool Call

VAD and Turn Detection: Speed vs. Context

1. Automatic Activity Detection (AAD)

2. The Semantic Layer

Why Sipfront?