Bring your own audio! The Audio to Video WebSocket API is stateful, event driven, server-to-server endpoint designed to receive audio and stream realtime AI avatar video directly to a WebRTC provider. It is designed receive faster-than-realtime speech from sources like:
- conversational frameworks: Livekit Agents, Pipecat, Agora, etc
- text to speech providers: 11 Labs, Cartesia etc
- speech to speech providers like OpenAI Realtime, Gemini Live, etc.
It is well‑suited for:
- Running your own backend voice orchestration stack (LiveKit Agent, Pipecat, OpenAI Realtime, etc) and using the resulting audio to drive the avatar
- Running customized workflows where you need complete control over the speech stack and timing logic
It is not designed for:
- Receiving audio from end user devices like browsers, apps, etc.
Note: This endpoint does NOT return video via WebSoket by design. We integrate directly into WebRTC networks like Livekit for the lowest latency and most reliable experience.
Beta
As a beta endpoint, our goal is to provide an early experience of our product so we can to receive feedback and iterate quickly. Additionally, while will do our best to maintain backwards compatibility and data contact stability, it is not guaranteed.
Reference Flow
The following diagram illustrates a common architecture and flow with the following components:
- Client - typically a browser or mobile app running your frontend code with a webrtc client sdk (ie Livekit SDK, Pipecat RTVI, Daily SDK, Agora SDK)
- Backend - your API backend usually with a webrtc server sdk (ie Livekit SDK, Daily SDK, Agora SDK, etc.)
- Agent Worker - your backend worker orchestrating the speech to speech flow (ie Livekit Agent, Pipecat, Ten Framework)
- HeyGenAPI - HeyGen API and Service
- WebRTC - a webrtc video provider like Livekit, Daily, Agora, etc.
- ASR - automatic speech recognition provider like Deepgram, Gladia, etc.
- LLM - large language model provider like OpenAI, Gemini, etc.
- TTS - text to speech provider like 11 Labs, Cartesia, etc.

Endpoint
The WebSocket address can be found in the realtime_endpoint
field in the response payload of the /v1/streaming.new
API call.
wss://webrtc-signaling.heygen.io/v2-alpha/interactive-avatar/session/<session_id
Client Actions
agent.speak
agent.speak
Stream audio chunks of avatar audio. Audio should be PCM 16Bit 24KHz bytes encoded as Base64
{
"type": "agent.speak",
"event_id": "<event_id>",
"audio": "{Send base64 encoded PCM 16bit 24khz audio segments}"
}
agent.speak_end
agent.speak_end
Signal to avatar the end of avatar audio. A final audio chunk can added, otherwise leave blank.
{
"type": "agent.speak_end",
"event_id": "<event_id>",
"audio": "{Send base64 encoded PCM 16bit 24khz audio segments}"
}
agent.audio_buffer_clear
agent.audio_buffer_clear
Discard any audio you’ve buffered.
{
"type": "agent.audio_buffer_clear",
"event_id": "<event_id>"
}
agent.interrupt
agent.interrupt
Stop any current and queued avatar tasks. This is usually followed by an agent.speak
{
"type": "agent.interrupt",
"event_id": "<event_id>"
}
agent.start_listening
agent.start_listening
Triggers the avatar's listening animation. This will only succeed if avatar is currently idle
{
"type": "agent.start_listening",
"event_id": "<event_id>"
}
agent.stop_listening
agent.stop_listening
Stops the listening animation (only if currently in listening).
{
"type": "agent.stop_listening",
"event_id": "<event_id>"
}
session.keep_alive
session.keep_alive
Resets activity idle timeout set in New Session API. Use this to keep session alive during periods of longer inactivity that exceed the activity idle timoeut.
{
"type": "session.keep_alive",
"event_id": "<event_id>",
}
Server Events
session.state_updated
session.state_updated
Server reports the session state changed.
{
"type": "session.state_updated",
"event_id": "<uuid>",
"state": "initialized | connecting | connected | disconnected"
}
- initialized - session is starting up
- connecting - session is waiting for participant to join
- connected - session and participant are all ready to go
- disconnecting - session is ending
agent.audio_buffer_appended
agent.audio_buffer_appended
Audio chunk accepted and buffered for the current task.
{
"type": "agent.audio_buffer_appended",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}
agent.audio_buffer_committed
agent.audio_buffer_committed
Buffered audio finalized and queued for playback for this task.
{
"type": "agent.audio_buffer_committed",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}
agent.audio_buffer_cleared
agent.audio_buffer_cleared
Buffered audio discarded/reset; does not trigger playback.
{
"type": "agent.audio_buffer_cleared",
"event_id": "<uuid>"
}
agent.idle_started
agent.idle_started
Avatar entered the idle state.
{
"type": "agent.idle_started",
"event_id": "<uuid>"
}
agent.idle_ended
agent.idle_ended
Avatar left the idle state.
{
"type": "agent.idle_ended",
"event_id": "<uuid>"
}
agent.speak_started
agent.speak_started
Avatar began speaking the given task.
{
"type": "agent.speak_started",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}
agent.speak_ended
agent.speak_ended
Avatar finished speaking the given task.
{
"type": "agent.speak_ended",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}
agent.speak_interrupted
agent.speak_interrupted
Avatar speech stopped early due to an interrupt.
{
"type": "agent.speak_interrupted",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}
error
error
A request failed; includes error type/message and the client event_id it refers to.
{
"type": "error",
"event_id": "<uuid>",
"error": {
"type": "invalid_request_error | server_error",
"message": "<string>",
"event_id": "<client_event_id>"
}
}
warning
warning
Non-fatal notice (e.g., deprecation); includes message and related client event_id.
{
"type": "warning",
"event_id": "<uuid>",
"warning": {
"type": "deprecation_warning",
"message": "<string>",
"event_id": "<client_event_id>"
}
}
Feedback & Improvements
This API is under continuous development for improved integration and performance. If you have any feedback, please share it with us!