Fully on-device · open source

On-device speech to text for iOS and Android, from one Kotlin core.

Hand Earshot an audio or video file and it gives you back the transcript, running entirely on the phone. No server, no per-user cost, no audio leaving the device. The only network call in the whole pipeline is the one-time model download on first run.

Audio never leaves the phone No per-user inference cost MIT licensed

How it works

Same model family, two on-device runtimes, one shared Kotlin core.

Earshot is the integration layer around an off-the-shelf model: getting it onto the device, extracting clean 16kHz audio from whatever file you started with, running it inside a phone's memory and battery budget, and exposing one API that behaves the same on both platforms.

01

Input file

An audio or video file from anywhere on the device.

02

Audio extraction

Decode to clean 16kHz mono PCM, on-device, per platform.

03

Whisper runtime

WhisperKit on iOS, ONNX Runtime on Android. Same model family.

04

Transcript

Text, language, confidence, and timing returned to you.

Per-platform implementation of each pipeline stage
Stage iOS Android
Audio extraction AVFoundation (AVAssetReader) MediaExtractor + MediaCodec
ASR runtime WhisperKit (CoreML) ONNX Runtime + Extensions
Model Whisper (CoreML, fetched + cached by WhisperKit) Whisper (Olive ONNX, int8)
Model download WhisperKit, internal ModelDownloader (plain HTTPS)
The cross-platform contract lives in commonMain as expect classes (AudioExtractor, TranscriptionEngine, ModelDownloader) with platform actual implementations. OnDeviceTranscriber is a thin facade that wires the extractor and engine together.

Quickstart

Pick a file, transcribe on-device, read the text back.

The whole surface is one facade. Each platform builds the engine and extractor its own way, then hands both to OnDeviceTranscriber. Here is the real wiring from each sample app.

A minimal Compose flow: make sure the Whisper model is present, wire the engine and extractor, then transcribe a media file on-device.

MainActivity.kt
val modelsDir = File(context.filesDir, "models")
val downloader = ModelDownloader(modelsDir)
downloader.downloadModelSync(WhisperModels.WHISPER_TINY_EN)

val engine = TranscriptionEngine().apply {
    setModelPath(File(modelsDir, WhisperModels.WHISPER_TINY_EN.name).absolutePath)
}
val transcriber = OnDeviceTranscriber(engine, AudioExtractor())
transcriber.prepare()

when (val r = transcriber.transcribeMedia(videoPath, "${context.cacheDir}/clip.wav")) {
    is TranscriptionEngineResult.Success -> println("${r.text} (${r.processingTimeMs}ms)")
    is TranscriptionEngineResult.Error -> println("failed: ${r.message}")
}

API surface

One small, honest facade.

OnDeviceTranscriber(engine, audioExtractor) is the whole public surface. Both dependencies are injected because each platform builds them differently.

suspend fun prepare(config) Load the model into memory so transcription can start. Call once before the first transcribe. Returns false if the model is not present or fails to load.
fun isReady() True once a model is loaded and ready to transcribe.
fun modelStatus() Current model status: not downloaded, downloading, ready, or error.
suspend fun transcribeAudio(wavPath) Transcribe a 16kHz mono PCM WAV file already on the device.
suspend fun transcribeMedia(mediaPath, scratchWavPath) Extract audio from a video or any AV file on-device, then transcribe it.
fun release() Release the model and any native resources.

Word error rate

Lower is better · same clips, same yardstick

iOS · WhisperKit on the Neural Engine 8.38%
Android · ONNX Runtime on a Pixel 9a 8.98%

Whisper tiny.en on both, scored on 25 LibriSpeech clips on real hardware. See the full benchmarks.

Measuring it

The point of on-device is that you can prove it where it runs.

Whisper tiny.en lands at 8.38% word error on an iPad's Neural Engine and 8.98% on a Pixel 9a, scored on the same clips by one yardstick. The two differ because of the runtime and its precision, float16 on CoreML versus int8 on ONNX, not the model.

0.02× real-time factor on the iPad: about a minute of speech transcribed in a second, on the device's own silicon.
These are real-device numbers, not lab estimates. Each platform ran its own on-device runtime on an actual iPad and Pixel, and every speed and memory figure says where it was measured. The method, the per-clip scores, and the full table are on the benchmarks page.

What this is, and what it is not.

I did not train a speech model. Whisper is OpenAI's. On iOS the Whisper runtime is WhisperKit by Argmax. On Android the model is a Microsoft Olive export of Whisper running on ONNX Runtime.

Earshot is the part around the model: getting it onto the device, extracting clean 16kHz audio from whatever file you started with, running the model inside a phone's memory and battery budget, and exposing one API that behaves the same on both platforms. That integration layer is the hard, unglamorous part of shipping a model onto someone's phone, and it is the part this library is about.

Models and licenses

Credit where it is due.

Earshot's own code is MIT. The speech models and runtimes it loads are third-party and carry their own licenses. Earshot does not redistribute model weights; they are fetched on-device at runtime from the sources below.

License: Earshot is MIT. The bundled integration code is MIT; the speech models it loads carry their own licenses. See MODELS.md for each model, its source, and its license, and LICENSE for the full text.

Speech to text that never leaves the phone.

Transcription is real and working on both platforms today. Audio extraction, the Whisper runtimes, and model download all run on-device.