Bring your own audio! The Audio to Video WebSocket API is stateful, event driven, server-to-server endpoint designed to receive audio and stream realtime AI avatar video directly to a WebRTC provider. It is designed receive faster-than-realtime speech from sources like:
- conversational frameworks: Livekit Agents, Pipecat, Agora, etc
- text to speech providers: 11 Labs, Cartesia etc
- speech to speech providers like OpenAI Realtime, Gemini Live, etc.
It is well‑suited for:
- Running your own backend voice orchestration stack (LiveKit Agent, Pipecat, OpenAI Realtime, etc) and using the resulting audio to drive the avatar
- Running customized workflows where you need complete control over the speech stack and timing logic
It is not designed for:
- Receiving audio from end user devices like browsers, apps, etc.
Note: This endpoint does NOT return video via WebSoket by design. We integrate directly into WebRTC networks like Livekit for the lowest latency and most reliable experience.
Beta
As a beta endpoint, our goal is to provide an early experience of our product so we can to receive feedback and iterate quickly. Additionally, while will do our best to maintain backwards compatibility and data contact stability, it is not guaranteed.
Reference Flow
The following diagram illustrates a common architecture and flow with the following components:
- Client - typically a browser or mobile app running your frontend code with a webrtc client sdk (ie Livekit SDK, Pipecat RTVI, Daily SDK, Agora SDK)
- Backend - your API backend usually with a webrtc server sdk (ie Livekit SDK, Daily SDK, Agora SDK, etc.)
- Agent Worker - your backend worker orchestrating the speech to speech flow (ie Livekit Agent, Pipecat, Ten Framework)
- HeyGenAPI - HeyGen API and Service
- WebRTC - a webrtc video provider like Livekit, Daily, Agora, etc.
- ASR - automatic speech recognition provider like Deepgram, Gladia, etc.
- LLM - large language model provider like OpenAI, Gemini, etc.
- TTS - text to speech provider like 11 Labs, Cartesia, etc.
Endpoint
The WebSocket address can be found in the realtime_endpoint field in the response payload of the /v1/streaming.new API call.
wss://webrtc-signaling.heygen.io/v2-alpha/interactive-avatar/session/<session_id
Client Actions
agent.speak
agent.speakStream audio chunks of avatar audio. Audio should be PCM 16Bit 24KHz bytes encoded as Base64
{
"type": "agent.speak",
"event_id": "<event_id>",
"audio": "{Send base64 encoded PCM 16bit 24khz audio segments}"
}agent.speak_end
agent.speak_endSignal to avatar the end of avatar audio. A final audio chunk can added, otherwise leave blank.
{
"type": "agent.speak_end",
"event_id": "<event_id>",
"audio": "{Send base64 encoded PCM 16bit 24khz audio segments}"
}agent.audio_buffer_clear
agent.audio_buffer_clearDiscard any audio you’ve buffered.
{
"type": "agent.audio_buffer_clear",
"event_id": "<event_id>"
}agent.interrupt
agent.interruptStop any current and queued avatar tasks. This is usually followed by an agent.speak
{
"type": "agent.interrupt",
"event_id": "<event_id>"
}agent.start_listening
agent.start_listeningTriggers the avatar's listening animation. This will only succeed if avatar is currently idle
{
"type": "agent.start_listening",
"event_id": "<event_id>"
}agent.stop_listening
agent.stop_listeningStops the listening animation (only if currently in listening).
{
"type": "agent.stop_listening",
"event_id": "<event_id>"
}session.keep_alive
session.keep_aliveResets activity idle timeout set in New Session API. Use this to keep session alive during periods of longer inactivity that exceed the activity idle timoeut.
{
"type": "session.keep_alive",
"event_id": "<event_id>",
}Server Events
session.state_updated
session.state_updatedServer reports the session state changed.
{
"type": "session.state_updated",
"event_id": "<uuid>",
"state": "initialized | connecting | connected | disconnected"
}- initialized - session is starting up
- connecting - session is waiting for participant to join
- connected - session and participant are all ready to go
- disconnecting - session is ending
agent.audio_buffer_appended
agent.audio_buffer_appendedAudio chunk accepted and buffered for the current task.
{
"type": "agent.audio_buffer_appended",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}agent.audio_buffer_committed
agent.audio_buffer_committedBuffered audio finalized and queued for playback for this task.
{
"type": "agent.audio_buffer_committed",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}agent.audio_buffer_cleared
agent.audio_buffer_clearedBuffered audio discarded/reset; does not trigger playback.
{
"type": "agent.audio_buffer_cleared",
"event_id": "<uuid>"
}agent.idle_started
agent.idle_startedAvatar entered the idle state.
{
"type": "agent.idle_started",
"event_id": "<uuid>"
}agent.idle_ended
agent.idle_endedAvatar left the idle state.
{
"type": "agent.idle_ended",
"event_id": "<uuid>"
}agent.speak_started
agent.speak_startedAvatar began speaking the given task.
{
"type": "agent.speak_started",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}agent.speak_ended
agent.speak_endedAvatar finished speaking the given task.
{
"type": "agent.speak_ended",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}agent.speak_interrupted
agent.speak_interruptedAvatar speech stopped early due to an interrupt.
{
"type": "agent.speak_interrupted",
"event_id": "<uuid>",
"task": { "id": "<task_id>" }
}error
errorA request failed; includes error type/message and the client event_id it refers to.
{
"type": "error",
"event_id": "<uuid>",
"error": {
"type": "invalid_request_error | server_error",
"message": "<string>",
"event_id": "<client_event_id>"
}
}warning
warningNon-fatal notice (e.g., deprecation); includes message and related client event_id.
{
"type": "warning",
"event_id": "<uuid>",
"warning": {
"type": "deprecation_warning",
"message": "<string>",
"event_id": "<client_event_id>"
}
}Feedback & Improvements
This API is under continuous development for improved integration and performance. If you have any feedback, please share it with us!
