Discussions
How to Reduce Initial Latency (8-9s) in Assistants API with File Retrieval while Maintaining Full Retrieval Functionality
3 months ago by Michael Hajster
We're building a virtual university advisor where maintaining accurate, knowledge-based responses is crucial. Our main goal is to keep the full retrieval functionality but significantly reduce the initial response latency.
Current Setup:
- OpenAI Assistant with:
- Vector store (138 KB)
- 4 JSON files with university program information
- File retrieval for accurate Q&A responses
- HeyGen Streaming Avatar for response delivery
Current Behavior:
[0ms] Starting response generation
[1220ms] Stream started
[8904ms] First text chunk received <- ~7.7s delay after stream start
[9424ms] First complete sentence
Critical Requirements:
- MUST maintain full retrieval capabilities
- MUST keep response accuracy from knowledge base
- MUST continue using file-based knowledge system
- Only want to optimize latency without compromising these features
Questions:
- How can we reduce this initial 7-8 second latency while keeping ALL retrieval functionality intact?
- Are there optimization techniques that don't compromise the retrieval quality?
- Could we optimize the vector store/file structure while maintaining the same knowledge coverage?
We specifically want to avoid solutions that suggest:
- Removing/reducing retrieval capabilities
- Simplifying the knowledge base
- Using simpler but less accurate responses
The goal is purely performance optimization while keeping the current functionality exactly as is. Idk if I should switch from Assistants API but this project has not so much time left and I am lost.