Blog
On-device vs cloud translation: the real trade-offs
Every translation product makes one decision that shapes all the others: where does the work run? On the phone in your pocket, or on a server somewhere else? Both answers are defensible. What is not defensible is hiding the choice from the person whose conversation is being translated.
What “on-device” actually means
On-device translation means the language models live on your phone. With Apple’s Translation framework, you download a language pack once, a few hundred megabytes per language, and from then on every translation runs locally. No request leaves the device. You can put the phone in airplane mode and the translation keeps working at exactly the same speed.
The same split exists for the other stages of the pipeline we described in how AirPods translation actually works: speech recognition can run on-device through Apple’s Speech framework, and the voices that speak the translation can be the ones built into iOS. A fully on-device pipeline is possible today, and it is what metcha’s free tier is.
What the cloud buys you
So why would anyone route a conversation through a server? Because context is expensive, and big models are better at it.
On-device translation is built for efficiency. It translates the sentence it is given, mostly literally. Ask it to render “it’s on the house” into Japanese and you may get a sentence about a building. A large language model in the cloud sees the utterance as part of a conversation, knows the register, and produces what a person would actually say. The difference is small for “where is the train station” and large for warmth, humor, indirectness, and anything idiomatic.
The cloud also buys better speech recognition in noisy rooms and dramatically better voices, including voices that can match the speaker’s accent. None of that is fluff; it is the difference between a translation you tolerate and one you forget is there.
The four axes
Every on-device versus cloud decision comes down to the same four trade-offs:
- Privacy. On-device, the conversation never leaves the phone. Cloud, the audio or text transits to a provider, and you are trusting their retention policy. This should be disclosed per stage, not waved at with the word “secure.”
- Latency. On-device translation takes 50 to 150 milliseconds; the LLM path takes 600 to 1,100. Cloud speech recognition is actually faster than local once streaming, but the network adds variance, and variance is what breaks conversational rhythm, as we covered in the latency post.
- Cost. On-device is free forever once the packs are downloaded. Cloud stages cost real money per minute, which is why they tend to live behind subscriptions.
- Quality. Cloud wins on fluency, accents, and noise. On-device wins on consistency: it works identically in a basement, on a plane, and in a foreign country with no data plan.
When on-device wins
Travel is the obvious case. The moments you most need translation, an underground metro platform, a rural bus stop, a restaurant basement, are precisely the moments connectivity disappears. A translator that needs bars is a translator that fails on exactly the wrong days.
It also wins whenever the conversation is sensitive. Money, health, family, anything you would not say in front of a stranger. With the on-device path there is no third party to trust, because there is no third party.
When cloud wins
Long, warm, high-stakes conversations in good network conditions. Meeting your partner’s grandmother, a business dinner, an interview. The idiomatic fluency and natural voices carry real weight when the relationship matters and the nuance is the point. In a quiet room with good Wi-Fi, the extra second is a fair price for translations that sound like a person.
How metcha splits the difference
We decided not to pick a side. The free tier is end-to-end on-device: Apple speech recognition, Apple translation, iOS voices, nothing leaves the phone. metcha Plus adds the cloud stages individually, Deepgram for recognition, a Claude-powered path we call Better Translation, ElevenLabs for voices, and the privacy policy names each provider and what it receives, because “AI-powered” is not a disclosure.
The part we care most about is the fallback. When the network degrades mid-conversation, metcha drops to the on-device path silently and the conversation continues, slightly less polished, never stalled. The cloud is an upgrade, not a dependency. The exact stage-by-stage details, including what each provider retains, are on the methodology page, and the pricing page spells out which stages come with which tier.