Blog
Why latency makes or breaks conversation translation
Ask people what matters in a translation app and they say accuracy. Watch people abandon a translation app mid-conversation and the reason is almost always speed. A translation that arrives four seconds late is accurate the way a joke explained the next morning is funny.
How fast do people normally respond?
Conversation has a tempo, and it is startlingly fast. A widely cited cross-cultural study of turn-taking measured the gap between one person finishing and the other starting across ten languages, and found a typical gap of around 200 milliseconds, with the whole distribution centered well under half a second. The rhythm is so universal that the researchers found it in every language they measured, from English to Yélî Dnye.
Two hundred milliseconds is faster than people can actually formulate a response, which means we begin planning our reply while the other person is still talking. Conversation is not ping-pong; it is overlap, prediction, and timing. That is the system a translator has to fit into.
Where do the seconds go?
We walked through the full pipeline in how AirPods translation actually works. From a latency standpoint, each stage takes its bite:
- Endpointing. The app waits for enough silence to be sure you are done. Roughly 250 ms, and it is irreducible comfort: trim it too far and the app interrupts people mid-thought.
- Speech to text. 150 to 300 ms streaming in the cloud, 300 to 600 ms on-device.
- Translation. 50 to 150 ms on-device, 600 to 1,100 ms through a large language model.
- Speech synthesis. 200 to 400 ms for local voices before the first audio plays, 400 to 900 ms for premium cloud voices.
Stack the slow end of every range and you are past three seconds. Stack the fast end and you are near one. The numbers we actually target and hit are published on the methodology page, and the spread between the free path and the all-cloud path, roughly 1.0 to 1.5 seconds versus 1.5 to 2.5, is exactly the sum of those choices.
Why streaming is the whole game
The single biggest latency decision in a translation pipeline is whether any stage waits for a complete unit of work before starting the next. A pipeline that transcribes the full utterance, then translates the full text, then synthesizes the full audio pays every stage’s worst case in sequence.
A streaming pipeline overlaps them. Transcription emits words while the speaker is still finishing. Synthesis starts playing the first chunk of audio while the rest is still being generated. The same stages, the same models, and the perceived wait drops by half or more, because the listener starts hearing the translation before the translation is finished being made.
This is also why “supports streaming” is worth more in a spec sheet than a tenth of a percent of accuracy. Any stage that buffers a whole turn blows the budget on its own.
How much delay can a conversation absorb?
More than 200 milliseconds, fortunately. The realistic benchmark is not unmediated conversation but interpreted conversation. When a human interpreter works consecutively, the speaker pauses, the interpreter renders the sentence, and everyone accepts the cadence because the rhythm stays predictable. The conversation slows, but it stays a conversation.
That is the window a translation app has to land in. In our experience the boundaries are roughly these:
- Under a second. Feels eager. People stop noticing the tool.
- One to two seconds. Feels like a polite interpreter. Eye contact survives, turn-taking survives. This is the conversational zone.
- Three to four seconds. People start repeating themselves, talking over the translation, or reaching for the phone screen to see what went wrong.
- Five seconds and beyond. Turn-taking collapses. The exchange becomes alternating dictation, which is the walkie-talkie feeling that made older translation apps exhausting.
The killer above two seconds is not the wait itself but the doubt. A predictable 1.5-second delay becomes rhythm; an unpredictable two-to-five-second delay means neither person knows when to speak, and that uncertainty, not the latency, is what kills the flow.
What this means when you choose a tool
Two takeaways. First, compare latency claims at the same point: the time from when the speaker stops to when the listener hears audio begin. Marketing numbers sometimes measure to when text appears on a screen, which skips the longest stage.
Second, expect a trade. Cloud stages buy fluency and voice quality with seconds; on-device stages buy speed and privacy with some polish. We built metcha so that trade is yours to make per session, and so the default, the free on-device path, sits comfortably inside the conversational zone. The full walkthrough of what that feels like in practice is at how it works.