Blog
How AirPods translation actually works
Two people sit across a table. One wears the left earbud, the other wears the right. They talk, each in their own language, and each hears the other in a language they understand. From the outside it looks like one device doing one trick. Under the hood it is four distinct systems passing work to each other several times a second, and understanding the relay explains almost everything about why these apps behave the way they do.
The earbuds are the simplest part
Start with the thing the product is named after. AirPods, and earbuds generally, do no translation. They contribute exactly two capabilities: a microphone near each mouth, and a speaker in each ear that the phone can address separately. The left channel can carry one language while the right channel carries another, which is what lets two people share a pair.
Everything else happens on the phone, or on servers the phone talks to. This is worth knowing before you buy anything, because it means the earbuds you already own are usually sufficient. The intelligence is in the software.
Stage one: capture
The app listens continuously and decides where one utterance ends and the next begins. This is called endpointing, and it is harder than it sounds. End a segment too eagerly and you translate half-sentences. Wait too long and the listener sits through dead air after the speaker has clearly finished.
Good systems use voice-activity detection tuned for natural pauses, which is why you can just talk instead of holding a button. The cost of this comfort is a built-in delay of roughly a quarter second: the app has to hear enough silence to be confident you are done.
Stage two: speech to text
The captured audio becomes text. There are two ways to do this, and the choice shapes the whole product.
On-device recognition uses something like Apple’s Speech framework. Nothing leaves the phone, it works with no signal, and it costs nothing per use. It is good with clear speech and weaker in noisy rooms.
Cloud recognition streams the audio to a service like Deepgram, which returns text in near real time. It handles accents, background noise, and overlapping voices better, at the price of a network round trip and a privacy decision that should be made explicit, not buried.
In metcha, the free tier uses the on-device path and metcha Plus can route through Deepgram. Either way this stage takes roughly 150 to 600 milliseconds per turn.
Stage three: translation
The text moves between languages. On-device translation, like Apple’s Translation framework, downloads language packs once and then runs locally in 50 to 150 milliseconds. The output is competent and a little literal. It can miss idioms.
The alternative is a large language model in the cloud, which sees the sentence in conversational context and produces something a person would actually say. metcha’s optional Better Translation path uses Anthropic’s Claude for this. It reads better and costs more time, around 600 to 1,100 milliseconds.
A useful rule: literal translation is fine for transactions and noticeably worse for warmth, jokes, and anything indirect. Which one matters depends on the conversation you are having.
Stage four: text to speech
Finally the translated text is spoken into the listener’s ear, and only the listener’s ear. The speaker does not hear their own translation, both because it is disorienting and because it would leak back into the microphone.
iOS ships built-in voices that run locally. Premium cloud voices, like the ElevenLabs catalog metcha Plus uses, sound markedly more human and can even match the speaker’s accent, so the English you hear from a Japanese speaker carries a Japanese accent, the way it would in person.
How fast is it end to end?
Adding the stages up: the free, fully on-device path lands around 1.0 to 1.5 seconds from the moment a speaker finishes to the moment the listener hears the translation. With every cloud stage active it is more like 1.5 to 2.5 seconds. The complete stage-by-stage budget is on our methodology page, including the cases where we fall back from cloud to local.
Those numbers sit inside the window where an exchange still feels like conversation. They are also honestly slower than talking without a translator, and any product that implies otherwise is setting you up for disappointment.
The two hard problems nobody mentions
First, the phone hears one audio stream, not two. iOS presents a pair of earbuds as a single logical microphone, so the app cannot know which earbud picked up a voice. The way out is elegant: the session is locked to two languages, so detecting the language of an utterance identifies the speaker. Japanese in, must be them; English in, must be you.
Second, the conversation is half-duplex. One utterance is processed at a time, which assumes two people who take turns. Real face-to-face conversation works this way naturally, which is why the assumption holds far better than it sounds like it should.
What this means in practice
The pipeline rewards seated, turn-taking conversation: a meal, a front desk, a long taxi ride, a relative you have never been able to talk to. It is weakest in shouting-distance, walking-around situations, which is fine, because those are mostly solved by pointing a camera at a sign.
If you want the user’s-eye view of the same pipeline, how it works walks through a session step by step. The languages page lists which pairs are supported on which path, and pricing covers what the free tier includes, which is more than most people expect.