Passer au contenu
ENTRY 01ENGINEERING27 MAY 2026

Why Apple Silicon Changed Local Speech Recognition Forever

The M-series Neural Engine did not just speed up local speech recognition. It made it viable for the first time, and the implications are still unfolding.

Why Apple Silicon Changed Local Speech Recognition Forever
0.0

Preface

"People who are really serious about software should make their own hardware."

Alan Kay

1.0

Qu'est-ce qui m'a fait réfléchir

  1. 📐 Apple's "One More Thing" keynote (November 2020), where the M1 chip debuted with a 16-core Neural Engine inside a fanless MacBook Air
  2. 📊 Horace Dediu's ongoing analysis of Apple's silicon strategy on Asymco, particularly his argument that Apple treats chip design as a user experience investment, not a spec sheet exercise
  3. 🔬 The rapid progress in model quantization and distillation research, which has made it possible to run speech models that once required server GPUs on a device that fits in your lap

Five years ago, if you wanted accurate speech-to-text, you had exactly one viable option: send your voice to someone else's computer. Google's servers. Amazon's servers. Microsoft's servers. The cloud was not just the best path to accurate transcription. It was the only path.

Then Apple shipped a laptop chip with a 16-core Neural Engine capable of 11 trillion operations per second. Not for data centres. Not for a research cluster. For a MacBook Air without a fan.

That single architectural decision, putting a dedicated machine learning accelerator inside every Mac sold, changed the calculus of on-device AI speech recognition on Apple Silicon permanently. The question was no longer "can we run speech models locally?" It was "why would we run them anywhere else?"

This article is the story of that shift. What was broken before, what Apple Silicon fixed, and why the implications extend far beyond Apple's own ecosystem.

2.0

What Was Wrong With Local Speech Recognition Before 2020?

Local speech recognition is not a new idea. Dragon NaturallySpeaking shipped in 1997. Windows Speech Recognition existed for over a decade before Apple Silicon arrived. But these systems shared a common ceiling: they were CPU-bound, memory-constrained, and fundamentally limited in the models they could run.

Why did cloud speech recognition dominate for so long? The answer is straightforward. Speech recognition models are matrix multiplication machines. They take audio, convert it to spectrograms, and run those spectrograms through layers of neural network weights. Each layer involves millions of multiply-accumulate operations. CPUs can do this work, but slowly. A general-purpose CPU processes these operations sequentially, one instruction pipeline at a time. Running a modern speech model on an Intel Core i7 from 2019 meant either waiting several seconds for a result or running a model so small that accuracy suffered noticeably.

GPUs offered parallelism but introduced their own problems on laptops. Discrete GPUs drew significant power, generated heat, and required data to be copied between CPU memory and GPU memory, a transfer that added latency to every inference pass. Integrated GPUs were weaker. Neither was designed for the specific patterns of neural network inference.

The result was a hard trilemma. You could have fast recognition, accurate recognition, or private recognition. Pick two.

The speech recognition trilemma before Apple Silicon Triangle with Fast at the top vertex, Private at the bottom-left, and Accurate at the bottom-right. The edge between Fast and Accurate is highlighted to represent the path cloud services took. Center reads: Pick any two, before Apple Silicon. Fast Private Accurate Pick any two before Apple Silicon cloud services
01 / Speed
11T
Operations per second, M1 Neural Engine
02 / Efficiency
~1 W
Neural Engine power draw during inference
03 / Scale
38T
Operations per second, M4 Neural Engine
04 / Memory
0 ms
Data copy overhead with unified memory

Cloud providers solved the trilemma by throwing server-grade GPUs at the problem. An NVIDIA A100 in a data centre can run a large speech model in real time without constraint. But the cost of that solution is your audio travelling over the internet to someone else's infrastructure, where it is decoded, processed, and potentially stored. Fast and accurate, yes. Private, no.

This was the state of things for years. If you cared about privacy, you accepted worse accuracy. If you cared about accuracy, you accepted the privacy trade-off. The hardware simply did not exist to do both on a consumer device.

3.0

What Did Apple Silicon Actually Change?

Apple Silicon did not just make Macs faster. It changed the architecture of what a personal computer could do with machine learning workloads. Three specific changes matter for on-device AI speech recognition on Apple Silicon.

Intel-Era Laptops

CPU-bound, thermally throttled, copy-heavy

CPU-bound inference, or a discrete GPU with high power draw and separate memory pools. Every inference pass required copying model weights across the CPU-GPU boundary. Sustained ML workloads triggered thermal throttling within minutes, making real-time dictation unreliable over longer sessions.

Apple Silicon Macs

Dedicated Neural Engine, unified memory, sustained throughput

Dedicated Neural Engine running at 11-38 TOPS at roughly one watt. Unified memory shared by CPU, GPU, and Neural Engine with zero copy overhead. Sustained inference runs indefinitely at full performance, even in a fanless chassis with no thermal throttling.

The Neural Engine Is Purpose-Built for Inference

The Neural Engine is not a GPU. It is not a CPU. It is a dedicated accelerator designed specifically for the matrix operations that neural networks require. Where a CPU handles these operations one at a time and a GPU handles them in parallel but with overhead, the Neural Engine handles them in parallel with hardware-level acceleration tuned for the exact data patterns that ML inference produces.

The M1's 16-core Neural Engine delivered 11 trillion operations per second (TOPS). The M2 pushed that to 15.8 TOPS. The M4, shipping in current Macs, reaches 38 TOPS. To put that in context: 38 trillion operations per second is more than enough to run a large speech recognition model in real time while using a fraction of the power budget that a discrete GPU would require.

What makes this different from just "a faster chip"? Dedicated silicon. The Neural Engine is not sharing resources with your web browser or your code editor. When a speech model runs on the Neural Engine, it gets dedicated hardware that does nothing else. This is why Apple can deliver inference performance that rivals what you would get from a much larger, much hotter, much more expensive discrete GPU.

Unified Memory Eliminated the Bottleneck

Before Apple Silicon, Macs (and most PCs) used separate memory pools for the CPU and GPU. Running a speech model on the GPU meant copying the model weights and audio data from CPU memory to GPU memory before inference could begin, then copying the results back. This memory transfer was a latency tax on every single inference pass.

Apple Silicon's unified memory architecture gives the CPU, GPU, and Neural Engine shared access to the same pool of memory. No copies. No transfers. No latency tax. When the Neural Engine needs to access a speech model's weights, they are already there.

For speech recognition specifically, this matters enormously. A dictation session involves hundreds or thousands of inference passes, one for each chunk of audio. If each pass incurred even a few milliseconds of memory transfer overhead, the accumulated latency would make real-time transcription feel sluggish. Unified memory removes that overhead entirely.

The Thermal Envelope Made It Sustainable

Running ML inference at high throughput generates heat. On Intel-era laptops, sustained GPU or CPU workloads would trigger thermal throttling within minutes, the chip slowing itself down to avoid overheating. This made real-time, sustained speech recognition on a laptop unreliable. The first minute might feel fast. The fifth minute would not.

Apple Silicon's power efficiency changed this equation. The Neural Engine delivers its full TOPS rating at a fraction of the power draw of a discrete GPU. The M1's Neural Engine operates at roughly one watt during active inference. This is low enough that a MacBook Air, a machine with no fan at all, can sustain real-time speech recognition indefinitely without throttling.

This is not a benchmark curiosity. It is a practical requirement for a feature that people need to use for hours at a time. A writer dictating a manuscript, a doctor dictating clinical notes, a developer narrating commits: these are sustained workloads. They need hardware that does not fade after five minutes.

4.0

How Does On-Device Speech Recognition Work on Apple Silicon?

Understanding why Apple Silicon matters requires understanding what happens when you speak and a local model turns that speech into text. The pipeline has distinct stages, each with different computational demands.

Stage 1

Audio Capturemic input

Raw audio is captured from the microphone at 16kHz sample rate. Voice activity detection identifies when speech begins and ends, filtering silence and background noise before anything reaches the model.

Stage 2

Feature Extractionspectrogram

Audio waveforms are converted into mel spectrograms, a visual representation of frequency over time. This transforms raw audio into the structured input the neural network expects.

Stage 3

Neural Network InferenceNeural Engine

The spectrogram passes through the speech model's encoder and decoder layers. This is the computationally heavy stage where the Neural Engine's dedicated hardware delivers its advantage. Millions of matrix operations happen in parallel, with model weights read directly from unified memory.

Stage 4

Text Decoding and Cleanupoutput

Raw token predictions are decoded into text. An on-device language model handles punctuation, capitalisation, filler word removal, and formatting, all without sending a single byte to the cloud.

A small brass microphone next to an open leather notebook showing handwritten lines transitioning into typed lines, on a cream linen surface — evoking voice captured and turned into text on the device

Stage 3 is where Apple Silicon earns its keep. Neural network inference involves running the spectrogram through hundreds of millions of model parameters, each requiring multiply-accumulate operations. On a CPU, this stage is the bottleneck. On the Neural Engine, it runs fast enough that the user perceives transcription as instantaneous.

The entire pipeline, from microphone to finished text, happens in milliseconds. No network round trip. No server queue. No variable latency depending on how many other users are hitting the same endpoint. The speed is deterministic because the hardware is local and dedicated.

For a deeper look at each stage, see How Speech Recognition Works: Components and Architecture.

5.0

Why Does Local Processing Matter for Privacy?

The privacy argument for on-device AI speech recognition is not abstract. It is architectural.

When you dictate to a cloud service, your audio travels from your microphone to a remote server. That server decodes your speech, generates text, and sends it back. During this process, your raw audio exists on infrastructure you do not control. It may be logged for quality assurance. It may be retained for model training. It may be subject to a jurisdiction's data access laws. Even if the provider promises deletion, you have no way to verify it.

When speech recognition runs entirely on-device, the audio never leaves your hardware. There is no server to log it, no network to intercept it, no third party with access to it. The privacy guarantee is not a policy decision that a company can reverse with a terms-of-service update. It is a physical constraint enforced by the architecture itself.

Cloud Speech Recognition

Your voice travels to someone else's server

Audio is uploaded, processed remotely, and may be stored, logged, or used for model training. Privacy depends on corporate policy, not physics. Latency varies with network conditions and server load.

On-Device Recognition (Apple Silicon)

Your voice never leaves your machine

Audio is captured, processed, and discarded on local hardware. No network, no server, no third-party access. Latency is deterministic and measured in milliseconds. Privacy is a physical guarantee.

This distinction matters most for professionals handling sensitive material. Lawyers dictating privileged communications. Doctors recording clinical observations. Executives composing confidential strategy memos. For these users, the question is not whether cloud speech recognition is convenient. It is whether their professional obligations even permit sending voice data to a third party. In many regulated contexts, the answer is no.

Apple Silicon made this a solved problem. The hardware is fast enough, efficient enough, and accurate enough that choosing local processing no longer means choosing a worse experience. For an in-depth look at what Apple's own dictation service actually transmits, see Apple Dictation Privacy: What Data It Sends and How to Stop It.

6.0

How Does Yaps Take Advantage of Apple Silicon?

Yaps is designed from the ground up for on-device AI speech recognition. On Android, where Yaps ships as a full IME keyboard, all dictation runs locally through the device's neural processing hardware. On macOS, Yaps is built with Tauri v2 and Rust, a native architecture that gives it direct access to Apple Silicon's Neural Engine through Core ML.

Why does the implementation stack matter? Because not all "local" speech apps are created equal. An Electron-based app wrapping a Python script does not get the same access to the Neural Engine that a native Rust binary does. Electron adds a Chromium runtime, JavaScript overhead, and an abstraction layer between the app and the hardware. Yaps skips all of that. Native code, native frameworks, direct hardware access.

The practical result is visible in the numbers. Yaps uses under 200MB of RAM and starts in under one second. It picks the right speech model for your hardware automatically, so you do not need to understand chip generations or model sizes to get the best performance your machine can deliver. If you are interested in what dictation looks like in practice, the difference in responsiveness between a native on-device app and a cloud-dependent one is the kind of thing you feel immediately.

On Android, the same philosophy applies. Yaps ships as a full AI keyboard, an IME that works in any app, processing all dictation on-device. Your voice does not leave your phone. The on-device speech pipeline runs through the phone's neural processing capabilities, and an on-device language model handles text cleanup, punctuation, and formatting without any cloud dependency.

This cross-platform consistency matters. Whether you are dictating on your phone during a commute or on your Mac at a desk, the experience is the same: speak, get text, nothing leaves the device. Yaps for Windows is also in active development, targeting the NPUs in Intel and AMD's latest processors.

7.0

Is Every Device Getting a Neural Engine Now?

Apple Silicon proved that dedicated neural accelerators belong in consumer devices. The rest of the industry has followed.

Qualcomm's Snapdragon 8 Gen 3 and newer mobile chips include a Hexagon NPU capable of over 45 TOPS, enough to run sophisticated speech models on an Android phone with no cloud connection. This is, in part, how Yaps delivers on-device dictation on Android today. Intel's Meteor Lake and Arrow Lake laptop processors include a dedicated NPU for the first time. AMD's Ryzen AI series integrates an XDNA NPU directly on the processor die.

01 / Origin
2020
Apple shipped the first mass-market laptop NPU with M1
02 / Mobile NPUs
45+
TOPS, Snapdragon 8 Gen 3 Hexagon NPU in Android flagships
03 / Mac Today
38
TOPS, Apple M4 Neural Engine in current Macs
04 / Adoption
3+
Major chip makers now shipping dedicated NPUs in laptops

This is not coincidence. Apple demonstrated that consumers value devices where AI runs locally, provided the experience holds up. The M1 MacBook Air was not marketed as "an AI chip for developers." It was marketed as a laptop that was fast, quiet, and lasted all day. The Neural Engine was one of the reasons it could deliver on all three promises simultaneously.

What does this convergence mean for speech recognition? Within two to three years, nearly every new laptop and smartphone sold will include a dedicated neural accelerator. On-device AI speech recognition will not be a niche feature for privacy enthusiasts. It will be the architectural default, because the hardware to support it will ship in every device.

The trade-offs that defined speech recognition for twenty years are dissolving. You do not have to choose between fast and private. You do not have to choose between accurate and offline. The silicon in your pocket and on your desk can do all of it, right now, without sending a single audio sample to the cloud.

The gap between what local hardware can do and what a cloud server can do, for the specific task of transcribing one person speaking, is now negligible. The gap in privacy is not.

The case for on-device speech, 2026

For developers building voice-driven workflows, the implication is clear: you can now build applications that process speech locally with the same confidence you would have building for a cloud API, minus the latency, the cost per request, and the compliance headaches. The hardware layer is no longer the constraint. The software layer is catching up. The ecosystem is ready.

8.0

What Comes Next for On-Device Speech?

The current generation of on-device speech models is already good enough for most single-speaker, clear-audio dictation in English. But the trajectory points toward several capabilities that were previously cloud-exclusive:

Multi-language dictation. Quantised multilingual models are shrinking fast. Running real-time transcription in Spanish, French, German, or Mandarin on-device is a near-term possibility for chips with 30+ TOPS of neural processing headroom.

Speaker diarisation. Identifying who is speaking in a multi-person recording remains computationally expensive. Cloud services still hold an advantage here. But Apple's M4 Ultra, with 32 Neural Engine cores, has the raw throughput to make on-device diarisation plausible in workstation-class hardware. Mobile chips will follow.

Longer context and streaming. Current on-device models process audio in chunks, typically a few seconds at a time. Future architectures will process longer audio streams with better contextual understanding, improving accuracy for domain-specific vocabulary and reducing the "first word" correction problem.

9.0

Pensées finales

In November 2020, Apple shipped a chip that quietly made one of the oldest promises in computing real: your device is smart enough to understand your voice without asking anyone else for help.

The shift from cloud to local speech recognition is not a trend or a preference. It is an architectural inevitability driven by hardware that keeps getting better at the exact workloads speech models demand. Apple Silicon started it. Qualcomm, Intel, and AMD are accelerating it. Within a few years, sending your voice to a remote server for transcription will feel as unnecessary as uploading a photo to a remote server to crop it.

Yaps is built for this future, and it is available right now. On Android and macOS today, with Windows in active development, Yaps delivers on-device AI speech recognition that is fast, accurate, and completely private. Your voice never leaves your device. Start at yaps.ai.

10.0

Foire aux questions

What is on-device AI speech recognition?

On-device AI speech recognition is speech-to-text processing that runs entirely on your local hardware, using a neural network model stored on the device itself. No audio is sent to a cloud server. The model takes your spoken audio, converts it to a spectrogram, runs it through neural network layers on a dedicated accelerator like the Neural Engine or an NPU, and outputs text. The entire process happens in milliseconds without an internet connection.

Why does Apple Silicon matter for local speech recognition?

Apple Silicon matters because it includes a dedicated Neural Engine designed specifically for machine learning inference. Before Apple Silicon, running a speech model locally on a laptop meant using the CPU (too slow for real-time) or a discrete GPU (too power-hungry and thermally constrained). The Neural Engine delivers trillions of operations per second at roughly one watt of power draw, making real-time, sustained, accurate speech recognition possible on a consumer laptop for the first time.

Can I run speech recognition offline on an M1 Mac?

Yes. The M1's 16-core Neural Engine and unified memory architecture provide enough performance to run modern speech recognition models entirely offline. Applications built to use Core ML can access the Neural Engine directly, delivering real-time transcription without any internet connection. Yaps for macOS is built to take full advantage of this capability, with no audio ever leaving the device.

How does unified memory help with speech recognition performance?

Unified memory removes the need to copy data between CPU memory and accelerator memory. In traditional architectures, running a speech model on a GPU required copying model weights and audio data to GPU memory before each inference pass, then copying results back afterward. This added latency to every single transcription chunk. With unified memory, the Neural Engine reads directly from the same memory pool the CPU uses, resulting in zero copy overhead and zero transfer latency.

Is on-device speech recognition as accurate as cloud-based services?

For most single-speaker, clear-audio dictation tasks, yes. The accuracy gap has narrowed dramatically since 2020. Modern quantised speech models running on Apple Silicon achieve accuracy levels comparable to major cloud services for everyday English dictation. Cloud services may still hold an edge for difficult audio conditions such as heavy background noise, thick accents, or multi-speaker scenarios, because they can run larger models on more capable hardware. But for the daily dictation use case, local models on current hardware are more than sufficient.

Does Yaps run on Apple Silicon?

Yaps runs on Apple Silicon. The macOS app is built with Tauri v2 and Rust for direct Neural Engine access via Core ML, taking advantage of unified memory and the dedicated ML accelerator on every M-series chip. Yaps also ships on Android as a full IME keyboard with on-device speech recognition, and Windows is in active development. On every platform, all dictation processing happens on-device. No audio leaves your hardware.

What about Windows laptops and Intel processors?

Intel's Meteor Lake and Arrow Lake processors include a dedicated NPU, bringing similar on-device AI acceleration to Windows laptops. AMD's Ryzen AI series does the same with its XDNA architecture. The performance of these NPUs is improving rapidly, with current generations reaching 10-45 TOPS depending on the chip. Yaps for Windows is in active development and will take advantage of available neural accelerators when it ships.

How much RAM does on-device speech recognition require?

Modern quantised speech models are surprisingly efficient. Yaps uses under 200MB of total RAM, including the speech model, the application runtime, and all supporting processes. That is a fraction of what a web browser typically consumes. Apple Silicon's unified memory architecture helps here as well, since model weights do not need to be duplicated across separate memory pools.

Will on-device speech recognition replace cloud speech recognition entirely?

For individual dictation and personal transcription, on-device processing is already the better choice for most users: lower latency, complete privacy, and no per-request costs. Cloud speech recognition will likely persist for specialised enterprise use cases like multi-language real-time translation, speaker diarisation in large meetings, and batch processing at scale. But for the core use case of "I am speaking and I want text," local processing on modern silicon is now the right default answer.

How does Apple Silicon compare to Snapdragon NPUs for speech recognition?

Both architectures now deliver enough throughput to run real-time speech recognition entirely on-device. Apple's M4 Neural Engine reaches 38 TOPS. Qualcomm's Snapdragon 8 Gen 3 Hexagon NPU reaches 45+ TOPS. In practical terms, both can run a modern quantised speech model with headroom to spare. The differences show up in how each platform exposes that hardware to developers (Core ML on Apple, Qualcomm's QNN SDK on Snapdragon) and in the surrounding system: unified memory on Apple Silicon, dedicated NPU-DDR bandwidth on Snapdragon. For end users running speech recognition, the experience is comparable on flagship hardware from either ecosystem.

Is on-device speech recognition slower than cloud-based services?

No, it is generally faster. Cloud transcription involves a network round trip, which adds 100-500 milliseconds of latency before any processing begins, plus variable wait time depending on server load. On-device recognition runs in milliseconds with zero network overhead. For real-time dictation specifically, where every word should appear as it is spoken, local processing on Apple Silicon delivers a noticeably more responsive experience than any cloud service.

What is the difference between Core ML and the Neural Engine?

Core ML is Apple's framework for running machine learning models on its hardware. The Neural Engine is the dedicated silicon accelerator inside Apple Silicon chips that Core ML can target for inference. When an app uses Core ML, Apple's framework decides which hardware to run the model on: the CPU, the GPU, or the Neural Engine. Models optimised for the Neural Engine get the fastest and most power-efficient execution. For speech recognition, Core ML is the bridge between the application code and the hardware acceleration that makes real-time on-device transcription possible.

Does Yaps work on Intel Macs?

Yaps runs on Intel Macs, but on-device speech recognition is faster and more power-efficient on Apple Silicon because Intel Macs lack a dedicated Neural Engine. On Intel hardware, the speech model runs on the CPU, which works but uses more power and produces more heat under sustained dictation. For the best on-device experience, an Apple Silicon Mac (M1 or later) is the recommended hardware. The macOS 13 Ventura minimum applies to both architectures.

KEEP READING
ENGINEERING · 22 MIN READComment exécuter Whisper AI localement sur Mac en 2026 (étape par étape)ENGINEERING · 20 MIN READComment fonctionne la reconnaissance vocale : composants et architecture (2026)