Transform Your Photo into Real-Time Speech with V2 API

Quick Guide: Use RealTime Avatar in 8 Steps

In this section, we will guide you through the eight steps to establish a WebRTC connection and drive real-time speech using our API. Below is a concise summary of the process. You can also access the demo code in our repository.

1. Getting a Talking Photo ID (Optional)

Before initiating a real-time WebRTC connection, consider uploading your selected image to generate a speaking photo avatar, known as a "Photar". Use the Upload talking photo API to retrieve the talking_photo_id, we will use it as photar_id.

curl -X POST https://upload.heygen.com/v1/talking_photo \
-H 'X-Api-Key: <your api key>' \
-H 'Content-Type: image/jpeg' \
--data-binary '@<local file path>'
{
  "code": 100,
  "data": {
    "talking_photo_id": "<talking_photo_id>",
    "talking_photo_url": "<talking_photo_url>"
  }
}

You can also retrieve an existing talking_photo_id by calling the List talking photos API.

curl -X GET  https://api.heygen.com/v1/talking_photo.list \
-H 'accept: application/json' \
-H 'x-api-key: <your-api-key>'
{
  "code": 100,
  "message": "Success",
  "data": [
    {
      "id": "<talking_photo_id>",
      "circle_image": "<circle_image>",
      "image_url": "<image_url>"
    }
  ]
}

2. Getting a Voice ID (Optional)

Select the voice you wish to use for the real-time WebRTC connection by calling the List voices API to retrieve the voice_id. Ensure the selected voice has the support_realtime attribute set to true.

curl -X GET  https://api.heygen.com/v1/voice.list \
-H 'accept: application/json' \
-H 'x-api-key: <your-api-key>'
{
  "code": 100,
  "message": "Success",
  "data": {
    "list": [
      {
        "voice_id": "<voice_id>",
        ...
        "support_realtime": false
      },
      {
        "voice_id": "<voice_id>",
        ...
        "support_realtime": true
      }
    ]
  }
}

3. Create a new session

See detailed API reference
To establish a real-time WebRTC connection, initiate a new session by calling the New Session API. This step fetches the server's offer SDP (Session Description Protocol) and ICE (Interactive Connectivity Establishment) server.

curl -X POST  https://api.heygen.com/v2/realtime/new \
-H 'Content-Type: application/json' \
-H 'x-api-key: <your-api-key>'
-d '{
  "quality": "high",
  "avatar": {
    "avatar_type": "photar",
    "photar_id": "<photar_id>"
  },
  "voice": {
    "voice_id": "<voice_id>"
  },
  "dimension": {
    "width": 640,
    "height": 450
  }
}'
{
  "data": {
    "session_id": "<session_id>",
    "sdp": {
      "type": "offer",
      "sdp": "<sdp-data>"
    },
    "ice_servers": [
      {
        "urls": ["<url>", "<url>", ...],
        "username": "<username>",
        "credential": "<credential>"
      },
      ...
    ]
  }
}

In your main control logic, create a new WebRTC connection using this information.

sessionInfo = await newSession("high");  
const { sdp: serverSdp, ice_servers: iceServers } = sessionInfo;  
peerConnection = new RTCPeerConnection({ iceServers: iceServers });

4. Start the session

See detailed API reference
Once you have obtained the necessary connection information from the server, it's time to establish the connection. Send the answer SDP to the server, which is achieved using the Start Session API.

curl -X POST https://api.heygen.com/v2/realtime/start \
-H 'Content-Type: application/json' \
-H 'x-api-key: <your-api-key>'
-d '{
  "session_id": "<session_id>",
  "sdp": {
    "sdp": "<sdp>",
    "type": "answer"
  }
}'
{  
  "data": null
}

In general, follow these steps to complete the SDP exchange process:

  1. Creating a new WebRTC peer connection object and setting necessary callback functions. Such as ontrack, to handle incoming audio and video streams.
  2. Retrieve the SDP offer from the newSession API response and set it as the remote description using the setRemoteDescription() method.
  3. Generate the SDP answer by calling the createAnswer() method on the peer connection.
  4. Set the generated SDP answer as the local description using the setLocalDescription() method.
  5. Send the answer SDP to the server using the startSession API to establish the connection.

Here's an example in JavaScript:

const remoteDescription = new RTCSessionDescription(serverSdp);  
await peerConnection.setRemoteDescription(remoteDescription);

// Create and set local SDP description  
const localDescription = await peerConnection.createAnswer();  
await peerConnection.setLocalDescription(localDescription);

// Start the session  
await startSession(sessionInfo.session_id, localDescription);

5. Submit Network Information

After exchanging SDP, WebRTC requires the exchange of network information to establish a connection. Within the peerConnection.onicecandidate() callback function, obtain the network information. Then, use this information to call the Submit ice information API for submitting the network information.

curl -X POST https://api.heygen.com/v2/realtime/ice \
-H 'Content-Type: application/json' \
-H 'x-api-key: <your-api-key>'
-d '{
  "session_id": "<session_id>",
  "candidate": "<candidate>"
}'
{  
  "data": null
}

Here's an example in JavaScript:

peerConnection.onicecandidate = ({ candidate }) => {
    if (candidate) {
      handleICE(sessionInfo.session_id, candidate.toJSON());
    }
};

You can monitor the status of the peer connection through event listener oniceconnectionstatechange, and re-execute new, start, ice to obtain a new connection when the connection is disconnected.

peerConnection.oniceconnectionstatechange = (event) => {
    if (peerConnection.iceConnectionState === "disconnected") {
      createNewSession(); // re-execute new, start, ice
    }
  };

6. Drive the Avatar to Speak

Once the connection is established, you can drive the real-time speech of the avatar by calling the Talk text API. The avatar will articulate the provided text content.

curl -X POST https://api.heygen.com/v2/realtime/task \
-H 'Content-Type: application/json' \
-H 'x-api-key: <your-api-key>'
-d '{
  "session_id": "<session_id>",
  "text": "<text>"
}'
{  
  "data": null
}

7. Close the Session

To close the session and the connection, call the Close session API. This action terminates the session. If no task is sent within the session for 5 minutes, it will be automatically terminated.

curl -X POST https://api.heygen.com/v2/realtime/stop \
-H 'Content-Type: application/json' \
-H 'x-api-key: <your-api-key>'
-d '{
  "session_id": "<session_id>"
}'
{  
  "data": null
}

8. Loop Silent Video (Optional)

For an improved user experience, consider generating a silent video of approximately 3 seconds using the selected "Photar" image and looping it. This prevents page lag when the task video is not playing. Use the Create an Avatar Video V2 API for this purpose.

curl -X POST https://api.heygen.com/v2/video/generate \
-H 'Content-Type: application/json' \
-H 'x-api-key: <your-api-key>'
-d '{
  "video_inputs": [
    {
      "character": {
        "type": "talking_photo",
        "talking_photo_id": "<talking_photo_id>"
      }, 
      "voice":{
       "type":"audio",
       "audio_url": "https://resource.heygen.com/silent.mp3"
      }
    }
  ],
  "dimension": {
    "width": <width>, 
    "height": <height>
  }
}'
{  
  "error": "",  
  "data": {
    "video_id": "<video_id>"
  }
}

To obtain the silent video's public URL, use the Video Status API.

curl -X GET https://api.heygen.com/v1/video_status.get?video_id=<video_id> \
-H 'x-api-key: <your-api-key>'
{
  "code": 100,
  "message": "Success",
  "data": {
    "id": "<video_id>",
    ...
    "video_url": "<video_url>",
    ...
  }
}

Finally, loop the silent video when no task video is returned.

Here's a JavaScript example:

let lastBytesReceived;  
let lastVideoState;

const mediaElement = document.querySelector("#mediaElement");  
mediaElement.setAttribute('playsinline', '');

function playTaskVideo(stream) {  
  if (!stream) return;  
  mediaElement.srcObject = stream;  
  mediaElement.loop = false;  
}

function playSilenceVideo() {  
  mediaElement.srcObject = undefined;  
  mediaElement.src = "<SilenceVideo.mp4>";  
  mediaElement.loop = true;  
}

function onTrack(event) {  
  if (!event.track) return;

  statsIntervalId = setInterval(async () => {  
    const stats = await peerConnection.getStats(event.track);  
    stats.forEach((report) => {  
      if (report.type === 'inbound-rtp' && report.mediaType === 'video') {  
        const isVideoPlaying = report.bytesReceived > lastBytesReceived;  
        lastBytesReceived = report.bytesReceived;
        if (lastVideoState !== isVideoPlaying) {
          if (isVideoPlaying) {
              playTaskVideo(stream);
            } else {
              playSilenceVideo();
            }
          }
        }
      }
    });
  }, 500);  
}

If you have innovative ideas and require technical support, feel free to reach out to us. We are here to assist you. Additionally, the possibilities for exploration are limitless!

Conclusion

This guide simplifies the process of real-time speech transformation with HeyGen's V2 API. Explore the capabilities and reach out to our support team for assistance.

Enjoy coding and creating immersive experiences!