Blog

How one microphone tells two speakers apart

By Justin Tormey · June 19, 2026 · Methodology

Here is a reasonable assumption almost everyone makes about earbud translation: the left earbud hears you, the right earbud hears them, and the app routes accordingly. It is a sensible design. It is also impossible, and the workaround turns out to be one of the more interesting pieces of engineering in the product.

The constraint nobody expects

iOS does not give apps separate audio streams from the left and right earbuds. The system presents the pair as a single logical microphone, mixes or auto-selects between the physical mics as it sees fit, and hands the app one mono stream. Which earbud picked up a given voice is information the app simply never receives.

So when two people share a pair, everything both of them say arrives in one undifferentiated channel. The app’s first job, before translating anything, is answering a question it has no direct evidence for: who just spoke?

Language as identity

The answer comes from a constraint the app does control. A metcha session is configured for exactly two languages before anyone speaks, say English and Japanese. That transforms the problem. If an utterance is in Japanese, it came from the Japanese speaker. If it is in English, it came from you. Detecting the language identifies the person, and identifying the person determines everything downstream: which direction to translate, and which ear hears the result.

This is why the setup asks you to choose the pair explicitly instead of guessing. The pair is not just a translation setting; it is the entire speaker-attribution system.

Why detection runs twice

Language detection sounds trivial until you watch it fail. metcha runs two independent detectors on every utterance, because each one fails differently.

The first signal comes from the speech recognizer itself, which detects language from the audio. Its weakness is inertia: after thirty seconds of Japanese, it expects more Japanese, and the first short English utterance after a speaker switch can get mislabeled. It also occasionally tags a two-syllable fragment as some third language that is not in the session at all.

The second detector is Apple’s NLLanguageRecognizer, which runs on the transcribed text and, crucially, can be constrained to consider only the session’s two languages. It has no audio history to bias it and no way to answer outside the pair. On a speaker switch it resolves instantly, because it only ever sees the words.

The text-side detector is the one that makes routing decisions. The audio-side label is shown in the interface as a secondary signal, but it is not load-bearing. Two views of the same utterance, and the one without momentum gets the vote.

What still goes wrong

Honesty section. Language detection has a floor, and three cases sit below it.

Very short utterances are the main one. “Ok” is valid in a lot of languages. So are names, numbers, and loanwords; a Japanese speaker saying “Starbucks” gives the detector almost nothing. The mitigation is contextual: people take turns, so when an utterance is truly ambiguous, leaning toward the opposite of the previous speaker is right far more often than chance.

Code-switching within a sentence generally resolves to the dominant language, which is usually correct. And two consecutive utterances in the same language, when one person briefly tries the other’s tongue, are genuinely ambiguous on language alone.

Because the error rate is not zero, the recovery path matters as much as the accuracy. Every turn in the transcript can be tapped to flip its attribution: the utterance is re-translated the other direction and replayed in the correct ear. One bad guess costs a tap, not the conversation. More of these practical details live in the support FAQ.

Why not just use push-to-talk?

Push-to-talk solves attribution completely. Whoever holds the button is the speaker, no detection needed, and plenty of translation apps work this way.

The cost is the conversation itself. A button turns dialogue into transmissions, and the phone becomes an object both people stare at and pass back and forth. The entire premise of earbud translation, as we laid out in how AirPods translation actually works, is that the phone disappears and two people just talk. Automatic attribution is harder to build and occasionally wrong, and it is still the right trade, because the product exists to protect the thing push-to-talk destroys.

The languages page lists which pairs are supported today, each of which gets this same two-detector treatment.