Streaming Avatar different voice integration

I want to maintain the current lip-syncing functionality while integrating a third-party voice provider like Elevenlabs and LLM to improve response times. Is this feasible? If so, are there any existing sample implementations?

Additionally, I've noticed a 2-4 second delay between receiving the LLM response and the avatar talking. I'm aiming to minimize this latency. Would implementing streaming LLM response or switching to a faster voice provider and LLM help?

Alternatively, might deploying the application to the cloud improve performance? For reference, I'm comparing my results to the streaming avatar demo on HeyGen's website, which appears to have faster response times.