Stormin' The Castle

voicespeech-to-text

Still Talking?

by John Robinson @johnrobinsn

All Seeing Robot

Introduction

One of the challenges in developing a voice pipeline for a real-time voice-enabled AI agent is determining when the user has finished speaking.

I've been coding such a pipeline while developing a framework that allows bidirectional communication of video, audio, text and other data to my own AI agents. Leveraging WebRTC allows me to communicate with my AI agent/models from anywhere even if they are running on my own hardware at home. I only need to run a very light-weight signaling server in the cloud.

In the rest of this article, I'll give a brief overview of real-time voice pipelines and drill down a bit more on the problem of detecting when the user has finished speaking.

Overview of a Voice Pipeline

A typical speech to text pipeline consists of several key steps that are performed prior to feeding the audio into the text to speech subsystem:

These steps help optimize processing since ASR systems are computationally expensive. Additionally, some ASR models, like Whisper, tend to hallucinate when presented with just background noise, so the goal is to only present it with audio packets containing actual speech.

End of Utterance (EOU) Detection

While the steps outlined above help us to determine the beginning of an utterence, how do we determine when the user has completed their thought or request? A common heuristic approach accumulates speech-laden audio buffers until a sufficient period of silence is detected and then using that silence as the end-of-utterence marker. However, this approach is fragile since people naturally pause while speaking, sometimes using filler words such as "um", "uh.", etc.

A more robust approach is to use a small AI model that analyzes the in-bound speech patterns to avoid an unintended end of utterance (EOU) event.

Experimenting with LiveKit's EOU Model

During my research, I discovered LiveKit's turn-detector model, available on Hugging Face. However, documentation is lacking, so I worked out how to invoke the model directly with minimal code so that I could experiment with it.

This model differs from a traditional classifier that determines if an utterance is complete. Instead, it uses a (smallish) large language model (LLM) that processes text tokens and predicts the likelihood that the utterance is complete via next token prediction. If the EOU model predicts that the utterance is not likely complete at any given point of time, we can extend the amount of time to wait for the user to complete their voice input. Thereby improving the user experience.

Here's a link to a my colab notebook that provides a minimal example of how you can load up and experiment with the LiveKit EOU model.

Link to colab notebook.

Give this notebook a try to further understand LiveKit's approach to the problem and experiment with it yourself.

Summary and Conclusion

Detecting the end of an utterance correctly is a crucial aspect of real-time voice pipelines. It can make all the difference between having a user experience that is a joy to use as opposed to being a source of frustration. While other heuristic approaches exist, AI-based models, such as LiveKit’s turn-detector, provide a more reliable way to determine EOU.

Livekit's LLM-based approach has a number of advantages in that it is easy to train and integrate into an existing text to speech pipeline. More robust strategies would be to train an audio input model that can take into account changes in intonation and also integrate timing information from the input audio stream.

Let me know if you have any questions or suggestions!

Please like my content on twitter, @johnrobinsn


Share on Twitter |  Discuss on Twitter

John Robinson © 2022-2025