
Think about the last time you used Instagram, a chatbot, or an ATM. You learned the interface. You figured out where to tap, what to type, and how to phrase your query. Machines set the terms, and humans adapted. That bargain has held for decades, and it is the reason your parents eventually figured out how to share reels.
Voice breaks that bargain entirely.
“We have been speaking to each other for thousands of years,” said Ayush Anand, Head of Product Voice at Plivo, at DevSparks Bengaluru 2026. “In these two interfaces, the AI has to adapt to human beings, the way we speak in our languages, we mix codes, we pause, we reflect, we think.”
That inversion, simple as it sounds, is what makes voice AI one of the most technically demanding frontiers in the field right now. Ayush’s lightning talk at YourStory’s flagship developer summit laid out exactly why and what it takes to build infrastructure capable of meeting that challenge.
The tolerance gap
Ayush opened with a comparison that landed immediately. If a chatbot takes five seconds to respond to “where is my order”, you would barely notice. Put the same delay on a phone call, and you are already frustrated at the two-second mark.
The stakes of a single bad turn are also incomparably higher on voice. A chat interface shows you its mistakes on screen, giving you a chance to correct course. On a call, you may not even realize the agent has misheard you. “It’s going on its own tangent,” Ayush said, “and you don’t even know.”
He used a pointed example: the word Mumbai. Common, well represented in training data, and still regularly misheard by voice models as something else entirely. “Think about all the interesting names and places that India has,” he said.
India’s compounding complexity
If voice AI is tough everywhere, India makes it structurally tougher. Ayush pointed to 22 official languages, with roughly 60% of calls being code-mixed, switching between English and Hindi, Tamil, Bengali, or sometimes all three in a single sentence.
Most global models are trained predominantly on English. Hindi has some representation, but for languages like Odia or those spoken in India’s northeast, “the data hardly exists”, Ayush said. That absence is not a gap that prompts engineers can bridge.
Seven models, 750 milliseconds
Beyond language, there is the architecture problem. What looks like a single voice interaction is actually a cascade of seven separate models, each running in real time, often on different servers across different geographies.
There is noise isolation, stripping out a ringing phone or a crying child to isolate the primary voice. There is turn detection, the model’s ability to recognize that you have finished speaking. There is speech-to-text, a language model, text-to-speech, and more. “All these six, seven layers have to be processed in real time,” Ayush said, “and all of this has to happen within 750 milliseconds.”
The compounding effect is sobering. “If all of these models were 99% accurate, on an average, the whole pipeline is just 93% accurate,” he said. For comparison, a chat agent runs through roughly one such model. Voice runs through seven.
Speech-to-speech models, which could collapse the entire cascade into a single system and cut latency significantly, are on the horizon. “But it will still take time,” Ayush said. “The models are not ready yet.”
Where Plivo fits
Plivo has been infrastructure for voice long before AI entered the picture, Ayush explained, powering contact centers and the call-routing systems behind quick commerce apps. It now brings that same developer platform to voice AI, letting teams connect their choice of STT, TTS, and LLM providers through a single layer, or build their own models in-house if sufficient data exists. For teams that want to move faster, managed agents allow builders to stand up a voice agent from a prompt alone, prototyping and iterating without full development cycles. The platform is built with developers as the primary user, but also accommodates non-technical teams who need to build and manage agents without writing code, Ayush noted.
The session closed with a live demo of a clinic appointment reminder agent, a deliberately simple use case that Ayush was careful not to present as representative. Real calls involve frustration, anxiety, and the kind of unscripted human behavior that stress-tests every layer of the stack. Teams curious to see what production-grade voice AI looks like in practice were pointed to Plivo’s booth outside the main hall, where live demonstrations of solved use cases were on offer.

