The current state of the smart endpointing model is not as great and sometimes just delays the response by almost 1.5 secs when quick words like "Yes" or "Yeah" are spoken.
I think adding a more advanced endpointing and turn detection model models are a big request by the community right and would really improve the calls quality.
Livekit has implemented a perfect open source plugin that does this for it's text-to-speech models :
I hope we get something similar or an improved version of the Smart Endpointing model we have :)