On-device speech to text for iOS and Android, from one Kotlin core.
Hand Earshot an audio or video file and it gives you back the transcript, running entirely on the phone. No server, no per-user cost, no audio leaving the device. The only network call in the whole pipeline is the one-time model download on first run.
How it works
Same model family, two on-device runtimes, one shared Kotlin core.
Earshot is the integration layer around an off-the-shelf model: getting it onto the device, extracting clean 16kHz audio from whatever file you started with, running it inside a phone's memory and battery budget, and exposing one API that behaves the same on both platforms.
Input file
An audio or video file from anywhere on the device.
Audio extraction
Decode to clean 16kHz mono PCM, on-device, per platform.
Whisper runtime
WhisperKit on iOS, ONNX Runtime on Android. Same model family.
Transcript
Text, language, confidence, and timing returned to you.
| Stage | iOS | Android |
|---|---|---|
| Audio extraction | AVFoundation (AVAssetReader) |
MediaExtractor + MediaCodec |
| ASR runtime | WhisperKit (CoreML) | ONNX Runtime + Extensions |
| Model | Whisper (CoreML, fetched + cached by WhisperKit) | Whisper (Olive ONNX, int8) |
| Model download | WhisperKit, internal | ModelDownloader (plain HTTPS) |
commonMain as expect classes
(AudioExtractor, TranscriptionEngine, ModelDownloader)
with platform actual implementations. OnDeviceTranscriber is a thin
facade that wires the extractor and engine together.
Quickstart
Pick a file, transcribe on-device, read the text back.
The whole surface is one facade. Each platform builds the engine and extractor its own
way, then hands both to OnDeviceTranscriber. Here is the real wiring from
each sample app.
A minimal Compose flow: make sure the Whisper model is present, wire the engine and extractor, then transcribe a media file on-device.
val modelsDir = File(context.filesDir, "models")
val downloader = ModelDownloader(modelsDir)
downloader.downloadModelSync(WhisperModels.WHISPER_TINY_EN)
val engine = TranscriptionEngine().apply {
setModelPath(File(modelsDir, WhisperModels.WHISPER_TINY_EN.name).absolutePath)
}
val transcriber = OnDeviceTranscriber(engine, AudioExtractor())
transcriber.prepare()
when (val r = transcriber.transcribeMedia(videoPath, "${context.cacheDir}/clip.wav")) {
is TranscriptionEngineResult.Success -> println("${r.text} (${r.processingTimeMs}ms)")
is TranscriptionEngineResult.Error -> println("failed: ${r.message}")
}
Register the WhisperKit provider once at launch, then use the same shared API. Add
WhisperKit via Swift Package Manager; the reference Swift glue lives in
ios-support/WhisperKitTranscriptionProvider.swift.
// at launch
NativeTranscriptionProviderHolder.shared.implementation = WhisperKitTranscriptionProvider()
// anywhere after
let transcriber = OnDeviceTranscriber(engine: TranscriptionEngine(),
audioExtractor: AudioExtractor())
_ = try await transcriber.prepare(config: TranscriptionConfig())
let result = try await transcriber.transcribeAudio(wavPath: wavPath)
API surface
One small, honest facade.
OnDeviceTranscriber(engine, audioExtractor) is the whole public surface.
Both dependencies are injected because each platform builds them differently.
Word error rate
Lower is better · same clips, same yardstick
Whisper tiny.en on both, scored on 25 LibriSpeech clips on real hardware. See the full benchmarks.
Measuring it
The point of on-device is that you can prove it where it runs.
Whisper tiny.en lands at 8.38% word error on an iPad's Neural Engine and 8.98% on a Pixel 9a, scored on the same clips by one yardstick. The two differ because of the runtime and its precision, float16 on CoreML versus int8 on ONNX, not the model.
What this is, and what it is not.
I did not train a speech model. Whisper is OpenAI's. On iOS the Whisper runtime is WhisperKit by Argmax. On Android the model is a Microsoft Olive export of Whisper running on ONNX Runtime.
Earshot is the part around the model: getting it onto the device, extracting clean 16kHz audio from whatever file you started with, running the model inside a phone's memory and battery budget, and exposing one API that behaves the same on both platforms. That integration layer is the hard, unglamorous part of shipping a model onto someone's phone, and it is the part this library is about.
Models and licenses
Credit where it is due.
Earshot's own code is MIT. The speech models and runtimes it loads are third-party and carry their own licenses. Earshot does not redistribute model weights; they are fetched on-device at runtime from the sources below.
Whisper MIT
OpenAI's open speech recognition model, the shared model family across both platforms.
openai/whisper iOS CoreML runtimeWhisperKit MIT
Argmax's CoreML runtime for Whisper, powering the iOS transcription path.
argmaxinc/WhisperKit Android model artifactWhisper Olive ONNX MIT
A Microsoft Olive export of Whisper, the int8 model artifact used on Android.
microsoft/onnxruntime-inference-examples Android inference runtimeONNX Runtime + Extensions MIT / Apache-2.0
Microsoft's cross-platform inference runtime, executing Whisper on Android.
microsoft/onnxruntimeLicense: Earshot is MIT. The bundled integration code is MIT; the speech models it loads carry their own licenses. See MODELS.md for each model, its source, and its license, and LICENSE for the full text.
Speech to text that never leaves the phone.
Transcription is real and working on both platforms today. Audio extraction, the Whisper runtimes, and model download all run on-device.