WSS Audio to Video API (Beta)

Bring your own audio! The Audio to Video WebSocket API is stateful, event driven, server-to-server endpoint designed to receive audio and stream realtime AI avatar video directly to a WebRTC provider. It is designed receive faster-than-realtime speech from sources like:

  • conversational frameworks: Livekit Agents, Pipecat, Agora, etc
  • text to speech providers: 11 Labs, Cartesia etc
  • speech to speech providers like OpenAI Realtime, Gemini Live, etc.

It is well‑suited for:

  • Running your own backend voice orchestration stack (LiveKit Agent, Pipecat, OpenAI Realtime, etc) and using the resulting audio to drive the avatar
  • Running customized workflows where you need complete control over the speech stack and timing logic

It is not designed for:

  • Receiving audio from end user devices like browsers, apps, etc.

Note: This endpoint does NOT return video via WebSoket by design. We integrate directly into WebRTC networks like Livekit for the lowest latency and most reliable experience.

Beta

As a beta endpoint, our goal is to provide an early experience of our product so we can to receive feedback and iterate quickly. Additionally, while will do our best to maintain backwards compatibility and data contact stability, it is not guaranteed.

Reference Flow

The following diagram illustrates a common architecture and flow with the following components:

  • Client - typically a browser or mobile app running your frontend code with a webrtc client sdk (ie Livekit SDK, Pipecat RTVI, Daily SDK, Agora SDK)
  • Backend - your API backend usually with a webrtc server sdk (ie Livekit SDK, Daily SDK, Agora SDK, etc.)
  • Agent Worker - your backend worker orchestrating the speech to speech flow (ie Livekit Agent, Pipecat, Ten Framework)
  • HeyGenAPI - HeyGen API and Service
  • WebRTC - a webrtc video provider like Livekit, Daily, Agora, etc.
  • ASR - automatic speech recognition provider like Deepgram, Gladia, etc.
  • LLM - large language model provider like OpenAI, Gemini, etc.
  • TTS - text to speech provider like 11 Labs, Cartesia, etc.

Endpoint

The WebSocket address can be found in the realtime_endpoint field in the response payload of the /v1/streaming.new API call.

wss://webrtc-signaling.heygen.io/v2-alpha/interactive-avatar/session/<session_id

Client Actions

agent.speak

Stream audio chunks of avatar audio. Audio should be PCM 16Bit 24KHz bytes encoded as Base64

{
	"type": "agent.speak",
	"event_id": "<event_id>",
	"audio": "{Send base64 encoded PCM 16bit 24khz audio segments}"
}

agent.speak_end

Signal to avatar the end of avatar audio. A final audio chunk can added, otherwise leave blank.

{
	"type": "agent.speak_end",
	"event_id": "<event_id>",
	"audio": "{Send base64 encoded PCM 16bit 24khz audio segments}"
}

agent.audio_buffer_clear

Discard any audio you’ve buffered.

{
  "type": "agent.audio_buffer_clear",
  "event_id": "<event_id>"
}

agent.interrupt

Stop any current and queued avatar tasks. This is usually followed by an agent.speak

{
  "type": "agent.interrupt",
  "event_id": "<event_id>"
}

agent.start_listening

Triggers the avatar's listening animation. This will only succeed if avatar is currently idle

{
  "type": "agent.start_listening",
  "event_id": "<event_id>"
}

agent.stop_listening

Stops the listening animation (only if currently in listening).

{
  "type": "agent.stop_listening",
  "event_id": "<event_id>"
}

session.keep_alive

Resets activity idle timeout set in New Session API. Use this to keep session alive during periods of longer inactivity that exceed the activity idle timoeut.

{
	"type": "session.keep_alive",
	"event_id": "<event_id>",
}

Server Events

session.state_updated

Server reports the session state changed.

{
  "type": "session.state_updated",
  "event_id": "<uuid>",
  "state": "initialized | connecting | connected | disconnected"
}
  • initialized - session is starting up
  • connecting - session is waiting for participant to join
  • connected - session and participant are all ready to go
  • disconnecting - session is ending

agent.audio_buffer_appended

Audio chunk accepted and buffered for the current task.

{
  "type": "agent.audio_buffer_appended",
  "event_id": "<uuid>",
  "task": { "id": "<task_id>" }
}

agent.audio_buffer_committed

Buffered audio finalized and queued for playback for this task.

{
  "type": "agent.audio_buffer_committed",
  "event_id": "<uuid>",
  "task": { "id": "<task_id>" }
}

agent.audio_buffer_cleared

Buffered audio discarded/reset; does not trigger playback.

{
  "type": "agent.audio_buffer_cleared",
  "event_id": "<uuid>"
}

agent.idle_started

Avatar entered the idle state.

{
  "type": "agent.idle_started",
  "event_id": "<uuid>"
}

agent.idle_ended

Avatar left the idle state.

{
  "type": "agent.idle_ended",
  "event_id": "<uuid>"
}

agent.speak_started

Avatar began speaking the given task.

{
  "type": "agent.speak_started",
  "event_id": "<uuid>",
  "task": { "id": "<task_id>" }
}

agent.speak_ended

Avatar finished speaking the given task.

{
  "type": "agent.speak_ended",
  "event_id": "<uuid>",
  "task": { "id": "<task_id>" }
}

agent.speak_interrupted

Avatar speech stopped early due to an interrupt.

{
  "type": "agent.speak_interrupted",
  "event_id": "<uuid>",
  "task": { "id": "<task_id>" }
}

error

A request failed; includes error type/message and the client event_id it refers to.

{
  "type": "error",
  "event_id": "<uuid>",
  "error": {
    "type": "invalid_request_error | server_error",
    "message": "<string>",
    "event_id": "<client_event_id>"
  }
}

warning

Non-fatal notice (e.g., deprecation); includes message and related client event_id.

{
  "type": "warning",
  "event_id": "<uuid>",
  "warning": {
    "type": "deprecation_warning",
    "message": "<string>",
    "event_id": "<client_event_id>"
  }
}


Feedback & Improvements

This API is under continuous development for improved integration and performance. If you have any feedback, please share it with us!