latency and speech recognition

Modified on Sun, 2 Mar at 6:41 PM

When conversational input is sent to a Digital Person (i.e. a user says something) that information runs through a number of processes that cumulatively add up to the delay you see in a Digital Person responding. Those processes are:

Video streamed to SM servers
Audio transcribed by speech provider
Transcription processed by Skills API to generate a response
Response audio generated by speech provider
Video streamed to user

These processes are broad stroke and happen in parallel as much as possible.

We use latency based routing to send your user to the server which will have the lowest latency for them. The quality of their network connection and relative distance to our servers (by internet topology) will impact the video streaming to and from our servers. The ping and packet loss are more impactful than flat out speed. Once video has arrived at our servers latency from connecting to multiple services is typically very low as we connect to the closest data centers possible.

Your choice in speech providers and models will have an impact in two different ways:

STT: We stream audio for real time transcription, but the providers all have their unique algorithm for 'finalising' transcriptions. This tells us that the user has finished speaking and is and we can send the transcript as input to the Skills API. Under certain circumstances I've seen speech finalise immediately, in 0.5s, and sometimes even be held open for several seconds.
TTS: We also stream this audio asynchronously. Each voice seems to have different performance characteristics. Anecdotally a 'standard' voice responds faster than a 'neural' voice, I think reflecting the quality and complexity of the generation.

Generating a response to user input with the Skills API is where we often see a significant chunk of latency added and where you have the most control. Naturally you'd expect a response to be slower if you're performing an action or using a giant LLM, similarly if you're using a conversation platform they will have different performance characteristics when processing text. The chain of processes here are:

We pass finalised input to the Skills platform.
The input is routed through any preprocess Skills configured.
The input is routed to (typically) your main conversation Skill.
We wait for result(s) from the Skill and repeat routing if there's a fallback.
The response is routed through any postprocess Skills added.

The obvious take away from this is that the more Skills are chained together, the more processing and latency will be added. The parts which aren't so obvious are:

When a Skill integrates a 3rd party service like Voiceflow there's a hop to the Skill (lambda) and that routes to Voiceflow and back again.
Until recently Skills were synchronous with webhooks and can now be asynchronous with websockets. There's a small performance penalty each time a connection is established with a webhook, where as the websockets maintain that connection. Not all of our Skills will have been updated to use websockets yet.
Strictly speaking an orchestration server can be milliseconds faster than an equivalent Skill because it skips any routing logic.

If you want to improve the latency of your Digital Person, start by benchmarking it against the Digital People on soulmachines.com - these vary but generally use Deepgram STT, Microsoft TTS, the Generative conversation Skill with gpt-4o-mini and 2 post process skills.

Some tips for speeding up the response time are:

Use an asynchronous Skill so that the first sentence of the response can be sent without finishing processing the rest.
Include a short sentence at the start of the response so that TTS can finish processing that first sentence and stream it faster.
Do as little processing when generating a response as possible.

As for whether you should build an orchestration server to reduce speed - the key difference between an orchestration server and an asynchronous skill is that the orchestration server receives a lot of extra information about the scene and digital person. Its been described as a fire hose of information. For that benefit you have to your implement asynchronous message queueing and so on. I'd normally advise against this unless you have a specific use for that data or you just have a very specific use case that doesn't fit the Skills framework.

My read on the Voiceflow Skill is that its still using the synchronous api, so if you wrote your own integration as either an asynchronous skill or an orchestration server you might be able to shave a tiny bit of time off each response, but I recommend against doing this unless you can pinpoint that the voiceflow skill is your major bottleneck rather than voiceflow itself, or any of the other Soul Machines project settings. Remember that we will not provide support for custom development.

Just on speech recognition accuracy - double check there's nothing seriously wrong with your microphone quality / settings, then using the target equipment in the target environment test the different STT providers with the ideal language set. Anecdotally Deepgram does the best job of transcription right now and anecdotally US English tends to capture a wider range of English accents accurately than other English locales.