Discussions

Ask a Question
Back to All

How to Reduce Initial Latency (8-9s) in Assistants API with File Retrieval while Maintaining Full Retrieval Functionality

We're building a virtual university advisor where maintaining accurate, knowledge-based responses is crucial. Our main goal is to keep the full retrieval functionality but significantly reduce the initial response latency.

Current Setup:

  1. OpenAI Assistant with:
    • Vector store (138 KB)
    • 4 JSON files with university program information
    • File retrieval for accurate Q&A responses
  2. HeyGen Streaming Avatar for response delivery

Current Behavior:

[0ms] Starting response generation
[1220ms] Stream started
[8904ms] First text chunk received  <- ~7.7s delay after stream start
[9424ms] First complete sentence

Critical Requirements:

  • MUST maintain full retrieval capabilities
  • MUST keep response accuracy from knowledge base
  • MUST continue using file-based knowledge system
  • Only want to optimize latency without compromising these features

Questions:

  1. How can we reduce this initial 7-8 second latency while keeping ALL retrieval functionality intact?
  2. Are there optimization techniques that don't compromise the retrieval quality?
  3. Could we optimize the vector store/file structure while maintaining the same knowledge coverage?

We specifically want to avoid solutions that suggest:

  • Removing/reducing retrieval capabilities
  • Simplifying the knowledge base
  • Using simpler but less accurate responses

The goal is purely performance optimization while keeping the current functionality exactly as is. Idk if I should switch from Assistants API but this project has not so much time left and I am lost.