Discussions

Ask a Question
Back to All

Avatar receives user_start/stop but never returns user_talking_message / reply – need help tracing missing STT step

We are building a web page that lets users speak to a HeyGen Interactive Avatar in real-time (voice chat).
Architecture:

  1. Browser@heygen/streaming-avatar (v 2.0.13)
  2. Flask backend proxy – forwards every /v1/streaming.* call, adds heartbeat, auth, etc.
  3. HeyGen cloud

Manual avatar.speak({text, taskType: REPEAT}) works, video stream works, heartbeat ok.

The problem

When the user talks:

  • Browser fires USER_START then USER_STOP (so VAD works)

  • No USER_TALKING_MESSAGE, USER_END_MESSAGE, AVATAR_* events ever come back

  • WS “streaming.chat” shows only

    {"event_type":"user_start"}
    {"event_type":"user_stop"}
    

    – no STT transcript, no error.

Attempted fixes in full app (still silent)

TriedResult
Load protobufjs first (window.protobuf = …)✔ no SDK crash, still silent
Wait for audio WS OPEN (same logic as sandbox)still silent
sttSettings:{sampleRate:16000} in createStartAvatarstill silent
useSilencePrompt:true in startVoiceChatstill silent
Logged outgoing frames – we do see [FRAME] 512 bytes while speakingframes are leaving browser
Listened for stt_error, voice_error, error events – none receivedno error from server

Relevant code snippet (current prod page)

await avatar.startVoiceChat({isInputAudioMuted:false});

/* wait for WS open */
const ws = avatar.voiceChat._audioWebSocket;
await new Promise((res,rej)=>{
  if (ws.readyState === WebSocket.OPEN) return res();
  ws.addEventListener("open",res,{once:true});
  ws.addEventListener("error",rej,{once:true});
});

/* log frames */
const oldSend = ws.send;
ws.send = d => { console.debug("[FRAME]", d.byteLength); return oldSend.call(ws,d); };

await avatar.startListening();

avatar.on("stt_error",  e=>console.error("STT_ERR",e.detail));
avatar.on("voice_error",e=>console.error("VOICE_ERR",e.detail));

Logs from browser

[EV user_start] {event_type:"user_start"}
[FRAME] 512
[FRAME] 512
[EV user_stop] {event_type:"user_stop"}
(…no further events…)

Network › WS › streaming.chat only shows the two JSON lines above.

Backend proxy confirms request sequence

/v1/streaming.new        200
/v1/streaming.start       200
/v1/streaming.start_listening 200

No other streaming endpoints are hit after that.

Questions for the HeyGen team / community

  1. Are there circumstances where the server would ignore valid audio frames yet still send user_start/stop?
  2. Is there an additional flag (account-level or per-session) required to enable STT?
  3. Does startListening() need to be re-issued after WS open, or should awaiting open + single call suffice?
  4. Any known incompatibilities with proxying through fetch("/api/heygen/proxy?path=...") (body is unchanged JSON)?

Full session id of a failing run (2025-05-12 14:11 UTC): dbcc07a5-2ea8-11f0-8041-aafedb6f6c4d

Happy to provide full HAR / WS capture if needed.
Thanks for any insights!

– Mindhelp Chat Team