Following our exploration of OpenAI Realtime, we now turn our attention to Google Gemini Live. While both platforms aim to provide low-latency, conversational AI, the underlying “nervous systems” have distinct characteristics.
At Sipfront, our deep dive into the code allows us to compare these giants not just on their features, but on their raw technical behavior.
The Initialization: Ephemeral Auth Tokens
Similar to OpenAI, Gemini Live uses a short-lived token for secure client-side authentication. However, the “burn-in” process is slightly different.
1. The Auth Token Request
Our backend API requests an ephemeral token from Google. Unlike OpenAI, where we define the session parameters during token creation, Gemini’s token request is more focused on the lifecycle of the token itself.
Request:
1{
2 "expireTime": "2026-01-29T11:00:00.000000+00:00",
3 "newSessionExpireTime": "2026-01-29T10:01:00.000000+00:00",
4 "uses": 1
5}
Response:
1{
2 "name": "auth_tokens/abc123...",
3 "expireTime": "2026-01-29T11:00:00Z"
4}
We use this name value as a token to authenticate the WebSocket. A key difference here is the header: while OpenAI uses Authorization: Bearer, Gemini uses Authorization: Token (for ephemeral tokens) or x-goog-api-key (for regular keys).
2. The Setup Message (The Capabilities)
Once the WebSocket is connected, we send a setup message. This is where Gemini’s session truly comes alive. This is where we define the model, the voice, and the system instructions.
1{
2 "setup": {
3 "model": "models/gemini-2.5-flash-native-audio-preview-09-2025",
4 "generationConfig": {
5 "responseModalities": ["AUDIO"],
6 "speechConfig": {
7 "voiceConfig": {
8 "prebuiltVoiceConfig": { "voiceName": "Aoede" }
9 }
10 },
11 "temperature": 0.7
12 },
13 "systemInstruction": {
14 "parts": [{ "text": "You are a helpful assistant." }]
15 }
16 }
17}
The Anatomy of a Gemini Conversation
The flow is remarkably similar to OpenAI, but with a few protocol-specific twists:
sequenceDiagram
participant Caller as SIP Caller
participant Baresip as Baresip (openai_rt)
participant API as Sipfront API
participant Gemini as Google Gemini Live
rect rgb(240, 240, 240)
Note over Baresip, Gemini: 1. Auth Token Creation
Baresip->>API: Request Session Token
API->>Gemini: POST /v1alpha/auth_tokens
Gemini-->>API: { "name": "auth_tokens/..." }
API-->>Baresip: Ephemeral Token
end
Note over Baresip, Gemini: 2. WebSocket Setup
Baresip->>Gemini: WSS Connect (Authorization: Token auth_tokens/...)
Baresip->>Gemini: setup (Model, Voice, Instructions)
Gemini-->>Baresip: setupComplete
Note over Caller, Gemini: 3. Active Call & Audio Flow
Caller->>Baresip: RTP Audio (G.711/Opus)
Baresip->>Baresip: Resample to 16kHz PCM
Baresip->>Gemini: realtime_input (Base64)
Note over Gemini: Interruption Detection
Gemini->>Baresip: serverContent { "interrupted": true }
Baresip->>Baresip: Clear Injection Buffer
Gemini->>Baresip: serverContent (modelTurn with Audio)
Baresip->>Baresip: Decode & Buffer PCM (24kHz)
Baresip->>Caller: RTP Audio
Key Technical Differences
While the high-level flow is similar, the “devil is in the details”:
- Audio Sampling Rates:
- OpenAI: Expects 24kHz input and provides 24kHz output.
- Gemini: Expects 16kHz input and provides 24kHz output. This requires our baresip module to handle asymmetric resampling.
- Interruption Handling:
- OpenAI: Sends a dedicated
input_audio_buffer.speech_startedevent. - Gemini: Includes an
interrupted: trueflag within theserverContentmessage.
- OpenAI: Sends a dedicated
- Tool Calling:
- OpenAI: Uses a flat
toolsarray in the session update. - Gemini: Wraps tools in a
function_declarationsarray inside atoolsobject.
- OpenAI: Uses a flat
Tool-Calling in Gemini: Giving the Bot Hands
A voice bot becomes an agent when it can interact with the real world. In our implementation, we define tools that allow Gemini to control the SIP call or interact with external APIs.
1. Defining the Tools
Tools are declared during the initial setup message. Each tool is defined within a function_declarations array. Here is how we define our three core tools for Gemini:
1{
2 "setup": {
3 "tools": [
4 {
5 "function_declarations": [
6 {
7 "name": "hangup_call",
8 "description": "Ends the current SIP call immediately."
9 },
10 {
11 "name": "send_dtmf",
12 "description": "Sends DTMF tones (digits) to the caller.",
13 "parameters": {
14 "type": "object",
15 "properties": {
16 "digits": {
17 "type": "string",
18 "description": "The sequence of digits to send (0-9, *, #)."
19 }
20 },
21 "required": ["digits"]
22 }
23 },
24 {
25 "name": "api_call",
26 "description": "Performs an external HTTP API request.",
27 "parameters": {
28 "type": "object",
29 "properties": {
30 "method": { "type": "string", "enum": ["GET", "POST"] },
31 "uri": { "type": "string" },
32 "body": { "type": "string" }
33 },
34 "required": ["method", "uri"]
35 }
36 }
37 ]
38 }
39 ]
40 }
41}
2. How Gemini Calls a Tool
When the model determines a tool is needed, it sends a toolCall message over the WebSocket. Unlike OpenAI’s flat event structure, Gemini groups these into a functionCalls array:
1{
2 "toolCall": {
3 "functionCalls": [
4 {
5 "id": "call_gemini_123",
6 "name": "send_dtmf",
7 "args": {
8 "digits": "1234"
9 }
10 }
11 ]
12 }
13}
3. Responding to the Tool Call
After our C module executes the action (e.g., triggering the DTMF tones in baresip), we must provide a tool_response back to Gemini. This allows the model to acknowledge the result and continue the conversation.
1{
2 "tool_response": {
3 "functionResponses": [
4 {
5 "id": "call_gemini_123",
6 "response": {
7 "result": "Successfully sent DTMF digits: 1234"
8 }
9 }
10 ]
11 }
12}
VAD and Turn Detection: Speed vs. Context
Gemini Live provides a sophisticated approach to turn detection, primarily through its Automatic Activity Detection (AAD). While OpenAI explicitly separates “Server VAD” and “Semantic VAD”, Gemini integrates these concepts into a unified, highly tunable mechanism.
1. Automatic Activity Detection (AAD)
Gemini’s default mode is essentially a high-performance Server VAD. It uses a dedicated model to detect speech patterns and silence, allowing the bot to respond with minimal latency. We tune this in the realtimeInputConfig:
startOfSpeechSensitivity: Controls how easily the model triggers on new audio.silenceDurationMs: The “snappiness” of the response. We often set this to 400-600ms to balance natural pauses with quick turn-taking.prefixPaddingMs: Ensures the very beginning of a sentence isn’t clipped.
2. The Semantic Layer
While Gemini doesn’t have a separate “Semantic VAD” toggle, its underlying architecture is inherently language-aware. The model uses the context of the conversation to help decide if a silence is a “thinking pause” or a “turn completion.”
This behavior is influenced by the generationConfig, specifically how we prompt the model. By adjusting the system instructions, we can make the bot more patient or more eager to jump in, effectively achieving a similar result to OpenAI’s semantic tuning.
Why Sipfront?
By implementing both OpenAI and Gemini at the code level, Sipfront provides an unparalleled perspective on Voice AI performance. We can tell you if your bot’s perceived “slowness” is due to the model’s processing time, the asymmetric audio resampling, or the specific VAD tuning of the platform.
Understanding these nuances is critical for building reliable, human-like voice agents. Whether you are using OpenAI or Gemini, Sipfront helps you understand exactly what is happening under the hood.
If you are navigating the choice between these platforms or optimizing your existing bot, contact us to see how our deep technical expertise can accelerate your journey.
comments powered by Disqus