ENTRY 112ENGINEERING07 FEB 2026

How Speech Recognition Works: Components & Architecture (2026)

Q: What are the main components of a speech recognition system?

Five components: (1) audio capture and preprocessing (noise suppression, voice activity detection); (2) feature extraction (log-mel spectrograms); (3) an acoustic model — typically a transformer or conformer network — that maps audio to linguistic units; (4) a language model that resolves ambiguity using grammar and word-sequence probability; (5) a decoder and post-processor that picks the most likely transcription and adds punctuation, capitalisation, and formatting. Classical HMM systems kept these separate; modern end-to-end models like Whisper collapse stages 2–4 into a single network.

Q: What are the basics of speech recognition technology?

Speech recognition converts acoustic vibrations into text by predicting which sequence of words most likely produced the audio. The pipeline is: digitise the sound, extract features (a frequency-vs-time spectrogram), run a trained neural model on those features, and post-process the output into readable text. The technology dates to the 1950s, but the modern transformer-era arrived between 2020 and 2024. The decisive shift is that ASR models are now small enough to run entirely on a phone or laptop — removing the need to send audio to the cloud at all.

Q: How does speech recognition work?

Speech recognition works by converting audio into text through a multi-stage pipeline. First, raw audio is captured and cleaned through noise suppression and voice activity detection. Then, an acoustic model (typically a transformer neural network) converts the audio signal into linguistic features by analyzing a spectrogram representation. A language model refines the output by resolving ambiguities and applying contextual understanding. Finally, post-processing adds punctuation, capitalization, and formatting. Modern systems like Yaps run this entire pipeline locally on your Mac using Apple Silicon's Neural Engine.

Q: What is OpenAI Whisper and can it run on a Mac?

OpenAI Whisper is an open-source speech recognition model trained on 680,000 hours of multilingual audio data. It uses a transformer encoder-decoder architecture and achieves near-human accuracy across dozens of languages. Yes, Whisper can run locally on a Mac through projects like whisper.cpp (a C/C++ implementation) and Apple's WhisperKit (which compiles Whisper to Core ML for Apple Silicon). Yaps uses an optimized, Whisper-derived architecture that has been fine-tuned specifically for real-time macOS dictation with 4-bit quantization for minimal memory usage.

Q: Can speech-to-text work offline?

Yes, modern speech-to-text can work fully offline with no internet connection required. On-device speech recognition tools like Yaps run the entire transcription pipeline locally on your hardware. The neural network models, language models, and post-processing all execute on your Mac's Apple Silicon chip. There is no "degraded mode" - offline accuracy is identical to online accuracy because the same models run regardless of network connectivity. This means you can dictate on airplanes, in areas without cell service, or in secure facilities with full accuracy.

Q: Is offline speech recognition as accurate as cloud-based services?

In 2026, yes - for standard dictation tasks, on-device speech recognition has effectively reached parity with cloud services. Advances in model architecture (particularly Whisper-derived models), quantization techniques like GPTQ, and dedicated ML hardware like Apple Silicon's Neural Engine have closed the gap. Yaps achieves a word error rate within 0.5% of leading cloud providers on standard English dictation. For specialized vocabulary, on-device tools with personalization can actually outperform cloud alternatives.

Q: How does Apple Silicon's Neural Engine help with speech recognition?

Apple Silicon's Neural Engine is a dedicated hardware accelerator designed specifically for machine learning inference. It can perform up to 38 trillion operations per second (on the M4 chip), providing more than enough computational power to run multiple speech recognition models simultaneously in real time. When speech models are compiled to Apple's Core ML format, they run on the Neural Engine rather than the CPU or GPU, meaning dictation consumes minimal system resources and does not impact other running applications.

Q: What is model quantization in speech recognition?

Model quantization is the process of reducing the numerical precision of a neural network's parameters to decrease memory usage and increase inference speed. Yaps uses 4-bit GPTQ quantization, which reduces each model parameter from 32-bit floating-point to 4-bit integer precision - an 8x reduction in model size. A model requiring 1.6GB at full precision fits in approximately 200MB after quantization, with less than 1% accuracy loss on standard benchmarks. This is what makes it possible to run large, accurate speech models on consumer hardware like a MacBook Air.

Q: Why is privacy important for speech-to-text tools?

Speech data is uniquely sensitive - it contains both biometric identifiers (your unique voice characteristics) and the content of your communications (which may include confidential business information, medical records, legal correspondence, or personal thoughts). The track record of cloud providers with voice data includes Google's $68 million settlement for recording private conversations, lawsuits against companies like Fireflies.AI for harvesting biometric voice data without consent, and 276 million healthcare records breached in 2024. On-device processing eliminates these risks entirely by ensuring your audio never leaves your physical device.

Q: How does Yaps compare to other macOS dictation tools?

Yaps is designed as a privacy-first, offline-first voice app for macOS, Windows, and Android. Unlike cloud-dependent tools like Wispr Flow (which sends audio to OpenAI/Meta servers and captures screenshots) or Granola AI (which routes transcriptions through OpenAI/Anthropic), Yaps can process core workflows locally with zero data transmission. Unlike Electron-based alternatives, Yaps is built with Tauri v2 and Rust for minimal resource usage - under 200MB of memory with instant startup on desktop. It includes features like personalized vocabulary learning, smart dictation history, voice activity detection tuned for long-form dictation, cleanup for broken dictated text, text-to-speech, synced notes, and a studio editor for refining transcriptions.

Modern speech recognition isn't magic — it is a carefully orchestrated pipeline of acoustic processing, neural networks, and language modeling. Here is the full anatomy of an ASR system, from raw audio to text, and how transformer models displaced classical HMM approaches.

Yaps Team20 min read

How Speech Recognition Works: Components & Architecture (2026)

0.0

Preface

When you hold down Fn and start speaking to Yaps, your words appear on screen almost instantly - properly capitalized, punctuated, and formatted. It feels simple. Behind the scenes, it is anything but.

Modern speech recognition is a pipeline of sophisticated technologies working in concert, each solving a piece of the puzzle that turns acoustic vibrations into meaningful text. But there is a deeper question that most technical guides overlook: where that pipeline runs matters just as much as how it works.

In this comprehensive guide, we will walk through the entire speech recognition pipeline, explain the engineering decisions that make real-time dictation feel effortless, and examine why on-device processing represents a fundamental shift in how speech-to-text should work - especially when privacy, reliability, and latency are non-negotiable.

1.0

What Are the Main Components of a Speech Recognition System?

Every modern ASR system — on your Mac, your phone, or a cloud server — is built from the same five components:

Audio capture and preprocessing — microphone input, noise suppression, voice activity detection.
Feature extraction — converting waveforms into compact representations the model can consume (typically log-mel spectrograms).
Acoustic model — a neural network that maps audio features to linguistic units (phonemes, characters, or subword tokens).
Language model — grammar, context, and word-sequence probability; turns plausible-sounding fragments into actual sentences.
Decoder and post-processing — beam search picks the most likely transcription, then formatting adds punctuation, capitalisation, and number normalisation.

Classical systems wired these together with hand-tuned rules. Modern end-to-end models collapse stages 2–4 into a single neural network trained on raw audio and text. Stages 1 and 5 still earn their keep as pre- and post-processing scaffolding.

2.0

How Does Speech-to-Text Work?

At a high level, the five components above collapse into four practical stages most engineers think about when designing or debugging an ASR pipeline:

Audio capture and preprocessing - cleaning raw microphone input
Acoustic modeling - converting sound into linguistic features using neural networks
Language modeling - understanding context, grammar, and meaning
Post-processing - formatting, punctuation, capitalization, and correction

Each stage presents distinct engineering challenges. The choices you make at every level - what models to use, where to run them, how to optimize them - determine whether the result feels magical or frustrating. Let's dive into each stage in detail.

3.0

Stage 1: How Does Audio Preprocessing Work in Speech Recognition?

Before any AI touches your speech, the raw audio needs to be cleaned up. Your microphone captures everything - your voice, keyboard clicks, the hum of your fan, that construction outside your window. The first job is isolating what matters.

What Is Spectral Masking for Noise Suppression?

Yaps uses a real-time noise suppression model that runs locally on your device. This neural network has been trained on thousands of hours of noisy audio to distinguish speech from background noise. It operates on 20-millisecond audio frames, meaning it processes and cleans your audio 50 times per second.

The key challenge is doing this without introducing artifacts - the hollow, underwater quality you hear on bad conference calls. Our model uses a technique called spectral masking, where it learns which frequency bands contain speech and which contain noise, then selectively attenuates the noise while preserving the natural quality of your voice.

Spectral masking works by analyzing the frequency spectrum of each audio frame and generating a mask - a set of multipliers between 0 and 1 for each frequency bin. Frequency bins dominated by noise get multiplied by values close to zero, effectively suppressing them. Bins dominated by speech pass through largely untouched. The result is clean, natural-sounding audio that retains the speaker's vocal characteristics without the robotic quality of simpler noise-gating approaches.

Because this processing happens entirely on your Mac, there is zero network latency added to the pipeline. Cloud-based alternatives must upload raw audio, process it on remote servers, and return the cleaned signal - adding anywhere from 50 to 300 milliseconds of delay depending on connection quality.

What Is Voice Activity Detection and Why Does It Matter?

Not everything you say should be transcribed. Coughs, throat-clearing, background conversations - the system needs to know when you are actually dictating. Voice Activity Detection (VAD) is a lightweight classifier that runs continuously, flagging audio frames that contain intentional speech.

Our VAD model is particularly tuned for dictation patterns. It understands that a pause of two seconds mid-sentence is thinking time, not the end of an utterance. This prevents the fragmentation you see in many speech-to-text tools, where pauses cause the system to submit partial, broken transcriptions.

This is a subtle but critical differentiator. Many cloud-based transcription services use generic VAD models optimized for conversation (short turn-taking exchanges), not for dictation (long-form, thoughtful monologues with natural pauses). The result is that cloud tools often fragment your speech into disconnected chunks, losing the thread of your thought. Yaps treats dictation as its own interaction paradigm and tunes accordingly.

4.0

Stage 2: How Do Neural Networks Convert Speech to Text?

This is where the heavy lifting happens. The acoustic model takes preprocessed audio and produces a sequence of probable linguistic units - phonemes, word pieces, or characters, depending on the architecture.

How Are Audio Features Extracted for Speech Recognition?

Raw audio waveforms contain far more information than a speech model needs. The first step is converting the waveform into a more compact representation. Yaps uses log-mel spectrograms - a representation that mimics how the human ear perceives sound.

A spectrogram breaks audio into frequency bands over time. The "mel" scale warps these frequencies to match human perception (we are more sensitive to differences in low frequencies than high). The "log" transformation compresses the dynamic range, similar to how our ears perceive loudness logarithmically.

The result is a 2D image-like representation of your speech, where the x-axis is time, the y-axis is frequency, and the intensity represents energy. This is what the neural network actually processes - and it is remarkably efficient. A full minute of high-fidelity audio, which might be 10 megabytes as a raw waveform, compresses down to roughly 200 kilobytes as a mel spectrogram while retaining all the information needed for accurate transcription.

HMM vs Transformer: How Modern Speech Models Replaced the Old Stack

For thirty years, speech recognition ran on Hidden Markov Models (HMMs) layered over hand-crafted phoneme dictionaries and n-gram language models. Dragon NaturallySpeaking, Siri's early years, and most enterprise dictation from the 1990s through the late 2010s used variants of this stack. It worked, but it demanded heavy engineering, per-language pronunciation dictionaries, and manual feature work.

The transformer era — kicked off by Listen-Attend-Spell (2015), then carried by Conformer (2020), Whisper (2022), and the RNN-Transducer family Parakeet uses — collapsed the stack into a single neural network trained end-to-end on raw audio and text.

Dimension	Classical (HMM + GMM/DNN)	Modern (Transformer / Conformer)
Pipeline	Separate acoustic + pronunciation + language models	Single end-to-end network
Training data	Tens of hours, heavily annotated	Hundreds of thousands of hours, weakly labelled
Accent robustness	Brittle; per-accent tuning needed	Strong out of the box
Pronunciation dictionary	Required per language	Learned implicitly
Latency	Low, predictable	Higher, mitigated by streaming + speculative decoding
Edge deployment	Small footprint	Larger but quantisable to fit Apple Silicon
Examples	Kaldi, Dragon (legacy), Sphinx, SAPI	Whisper, Parakeet, Conformer-RNN-T, Apple on-device ASR

The classical stack still has a place in narrow domains — voice dialling, command-and-control on weak hardware, certain telephony pipelines. For open-vocabulary dictation in 2026, transformers have effectively won. The remaining question is where you run it — which is where the cloud-versus-on-device debate lives.

What Is OpenAI Whisper and How Does It Work?

The landscape of speech recognition changed dramatically with the release of OpenAI Whisper, an open-source speech recognition model trained on 680,000 hours of multilingual audio. Whisper demonstrated that a single, large transformer model could achieve near-human accuracy across dozens of languages without the complex, multi-component pipelines that dominated earlier approaches.

Whisper uses an encoder-decoder transformer architecture - the same family of models behind large language models like GPT, but adapted for audio input:

The encoder processes the mel spectrogram through multiple layers of self-attention, building increasingly abstract representations of the audio. Early layers capture low-level acoustic features (vowel formants, consonant bursts), while deeper layers capture higher-level patterns (syllable structure, speaking rhythm, accent characteristics).

The decoder generates text tokens auto-regressively - one at a time, each conditioned on the audio encoding and all previously generated tokens. This is what gives the model its ability to handle ambiguity. When it encounters a sound that could be "their," "there," or "they're," the decoder uses context from the rest of the utterance to choose correctly.

What made Whisper transformative was not just its architecture but its training data. By training on hundreds of thousands of hours of diverse audio - different accents, recording conditions, background noise levels, and speaking styles - the model developed remarkable robustness that previous systems lacked.

Can Whisper AI Run Locally on a Mac?

Yes. This is one of the most significant developments in speech recognition over the past two years. Projects like whisper.cpp (a C/C++ port of Whisper by Georgi Gerganov) and Apple's WhisperKit have made it possible to run Whisper-class models entirely on-device, with no internet connection and no data leaving your machine.

Yaps builds on this foundation. Our acoustic model is derived from the Whisper architecture but has been extensively optimized for real-time, on-device dictation on macOS. The key optimizations include:

Core ML compilation for Apple's Neural Engine (more on this below)
4-bit quantization to reduce memory footprint by 8x
Streaming inference for real-time text output while you are still speaking
Dictation-specific fine-tuning for English prose, technical vocabulary, and natural speech patterns

The result is a model that runs locally with accuracy matching cloud services from major providers - something that would have been impossible as recently as 2023.

How Does Streaming Speech Recognition Work?

Many speech recognition systems wait until you stop speaking to process the entire utterance at once. This creates a noticeable delay between speaking and seeing text. Yaps uses streaming recognition: the model begins producing text while you are still speaking.

This is technically challenging because early in an utterance, the model has limited context. It might initially transcribe "I want to book a" as the beginning of a hotel reservation before hearing "flight to Tokyo." Our model handles this with speculative decoding - it produces a best guess in real-time but maintains the ability to revise earlier tokens as more audio arrives. On screen, you see text appearing smoothly, with occasional subtle corrections as the model refines its understanding.

Speculative decoding works by running two passes simultaneously: a fast, lightweight pass that generates initial predictions with low latency, and a more thorough pass that verifies and corrects those predictions as more context becomes available. The user sees the fast pass first, with corrections applied so smoothly that the process feels like continuous, fluid transcription rather than a sequence of corrections.

Key Takeaway

Streaming recognition is what separates dictation tools that feel instant from those that feel sluggish. The dual-pass approach - fast speculative output followed by silent correction - is the same technique used by modern LLMs for faster token generation. Applied to speech, it means you never wait for the model to "catch up" to your voice.

5.0

Stage 3: How Does Language Modeling Improve Speech Recognition Accuracy?

The acoustic model gives us a rough transcription. The language model refines it, using its knowledge of grammar, vocabulary, and common phrases to correct errors and resolve ambiguity.

How Does Contextual Understanding Work in Speech-to-Text?

Consider the phrase "recognize speech." Acoustically, it is almost identical to "wreck a nice beach." Without understanding context, a purely acoustic model would struggle to distinguish them. The language model knows that "recognize speech" is a coherent English phrase while "wreck a nice beach" is grammatically unusual, and weights its output accordingly.

Yaps uses a custom language model that has been fine-tuned for dictation-style speech. This means it understands patterns like:

Run-on sentences that are common in spoken English
Self-corrections ("no, wait, I meant...")
Dictated punctuation commands ("period," "new paragraph," "comma")
Technical vocabulary across common professional domains (legal, medical, engineering, finance)
Code-switching between formal and informal registers within the same dictation session

How Does Personalized Speech Recognition Work?

Over time, Yaps learns your vocabulary. If you frequently use specialized terms - legal jargon, medical terminology, company-specific acronyms - the language model adapts. This happens entirely on-device: your personal language model is stored locally and never uploaded.

The technical mechanism is a small, personalized n-gram model that sits alongside the main language model. When you use a word the main model does not recognize well, the personalized model boosts its probability in future transcriptions. It is a simple technique, but it makes a dramatic difference for specialized vocabularies.

For example, a radiologist who regularly dictates reports with terms like "pneumomediastinum" or "hepatosplenomegaly" will find that Yaps quickly learns these terms and transcribes them accurately without manual correction. This adaptation is stored as a small local file - typically under 5 megabytes - that captures your unique vocabulary patterns without storing any of your actual dictated content.

6.0

Stage 4: How Does Post-Processing Create Polished Text from Speech?

Raw transcription - even good raw transcription - looks nothing like polished text. Post-processing transforms stream-of-consciousness speech into properly formatted written language.

How Does Automatic Punctuation Work?

Yaps does not require you to say "period" or "comma." Our punctuation model analyzes the transcript and inserts punctuation based on prosodic cues (pauses, intonation) and syntactic structure. It handles:

Periods and commas
Question marks (detected from rising intonation and interrogative syntax)
Exclamation points (detected from emphasis and context)
Semicolons and colons (from specific grammatical patterns)
Quotation marks (when you are clearly quoting someone)

How Does Automatic Capitalization and Formatting Work?

Sentence-initial capitalization is straightforward, but proper noun detection is more nuanced. The model uses a named entity recognizer to identify people, places, organizations, and products, capitalizing them correctly. It also handles common patterns like formatting numbers ("three hundred" becomes "300" or "three hundred" depending on context) and common abbreviations.

What Is Disfluency Removal in Speech Recognition?

Spoken language is full of disfluencies - "um," "uh," false starts, and repeated words. Yaps automatically removes these, producing clean text that reads as if it were typed. This is one of the most noticeable differences between Yaps and basic transcription tools, which faithfully reproduce every filler word.

Disfluency removal uses a sequence-labeling model that classifies each token as either "keep" or "remove." It is trained on parallel corpora of spoken and written language, so it understands not just which words to remove but how to restructure the remaining text so it flows naturally.

7.0

Cloud vs. On-Device Speech Recognition: What Is the Difference?

The technical pipeline described above can run in two fundamentally different environments: on remote cloud servers or locally on your own device. This is not just an architectural choice - it has profound implications for privacy, latency, reliability, and who ultimately controls your data. For a practical look at how these architectural differences play out across specific products, see our comparison of the best dictation apps for Mac.

How Does Cloud-Based Speech Recognition Work?

Most popular speech-to-text tools - including Wispr Flow, Granola AI, Otter.ai, and many others - route your audio through cloud servers. The process typically works like this:

Your microphone captures audio on your device
The raw audio (or lightly compressed audio) is uploaded to remote servers operated by companies like OpenAI, Google, Amazon, or Meta
The speech recognition pipeline runs on powerful cloud GPUs
The transcribed text is sent back to your device
Your audio data may be retained on those servers for varying periods

This approach has one advantage: cloud providers can run larger models on more powerful hardware. But this advantage has been shrinking rapidly as on-device capabilities have improved.

What Are the Latency Differences Between Cloud and Local Speech Recognition?

Cloud-based speech recognition adds network round-trip time to every transcription. On a fast connection, this might be 50-100 milliseconds. On a slow connection, hotel WiFi, or a mobile hotspot, it can balloon to 300-500 milliseconds or more. And those numbers assume the servers are not under heavy load.

On-device processing eliminates this entirely. Yaps processes audio with a consistent latency of under 20 milliseconds regardless of network conditions, because the neural network is running directly on your Mac's Apple Silicon Neural Engine. The result is text that appears to flow from your speech in real time, with no perceptible delay.

Cloud Processing

50-500ms latency depending on connection quality. Fails completely offline. Audio uploaded to third-party servers. Accuracy varies under server load. Your voice biometrics stored remotely.

On-Device Processing

Under 20ms latency, every time. Works fully offline, anywhere. Audio never leaves your Mac. Consistent performance regardless of load. Zero biometric data exposure.

Can Speech Recognition Work Offline?

This is one of the most important practical questions for anyone who relies on voice-to-text in their daily workflow. The answer depends entirely on architecture.

Cloud-dependent tools cannot work offline. If you are on an airplane, in a rural area without cell service, working in a secure facility that restricts internet access, or simply experiencing an ISP outage, cloud-based speech recognition stops completely. Your workflow grinds to a halt.

On-device tools work everywhere. Yaps runs its entire offline dictation pipeline locally on your Mac. There is no internet dependency whatsoever. You can dictate on a transatlantic flight, in a remote cabin, or in a government SCIF with the same accuracy and speed as you would at your desk with gigabit fiber. The models, the inference engine, the language model, the post-processor - everything runs on your hardware.

This is not a degraded "offline mode" with reduced accuracy. The same models run whether you are connected or not. The experience is identical.

This reliability matters especially for users who depend on voice input as a primary input method - not by preference, but by necessity. For professionals managing RSI, carpal tunnel syndrome, or other repetitive strain conditions, a voice tool that stops working when WiFi drops is not just inconvenient, it is a workflow failure. Our guide on using voice input for RSI and repetitive strain injuries explores how on-device reliability factors into accessibility-first voice workflows.

8.0

Why Does Privacy Matter in Speech Recognition?

Speech is among the most sensitive data a person can produce. Your voice carries biometric identifiers unique to you. The content of your speech - emails, medical notes, legal briefs, personal journals, business strategies - is often confidential or privileged. When this data is sent to cloud servers, you are trusting third parties with some of the most private information in your digital life. We explore the full range of what voice data reveals - from emotional state to health indicators - in our article on why your voice data is more sensitive than you think.

What Are the Privacy Risks of Cloud Speech Recognition?

The track record of cloud providers with voice data should give anyone pause:

Data breach exposure is massive and growing. In 2024 alone, 276 million healthcare records were breached in the United States. Speech data processed in the cloud - including medical dictations, therapy session notes, and patient records - is part of this attack surface.

Companies have been caught recording and retaining private conversations. Google paid a $68 million settlement for recording private conversations through its voice assistant without adequate user consent. Amazon, Apple, and other tech companies have all faced scrutiny for having human contractors review voice recordings from their assistants.

Biometric voice data is being harvested without consent. Fireflies.AI, a popular meeting transcription tool, was sued for harvesting biometric voice data - the unique vocal characteristics that identify you as an individual - without user consent. Voice biometrics, once captured, cannot be changed like a password.

Cloud transcription tools route data through third-party AI providers. Wispr Flow, a popular macOS dictation tool, sends all audio to OpenAI and Meta servers for processing and captures screenshots of your screen for context. Granola AI routes meeting transcriptions through OpenAI and Anthropic. When you use these tools, your data passes through not just the tool provider but also their upstream AI vendors - multiplying the number of parties with access to your private speech.

Healthcare tools built on cloud infrastructure are particularly concerning. Heidi Health, an AI medical scribe used in clinical settings, is 100% cloud-dependent on Google Cloud infrastructure. It has faced reports of hallucination issues, including fabricating patient information in clinical notes. When the stakes are a patient's medical record, cloud dependency introduces both privacy and accuracy risks.

User concern is widespread. According to industry research, 40% of voice assistant users worry about "who is listening" to their voice data. This is not paranoia - it is a rational response to a documented history of misuse.

How Does On-Device Processing Protect Your Privacy?

Yaps takes a fundamentally different approach: 100% on-device processing with zero data transmission. Here is what that means in practice:

Your audio never leaves your Mac. It is captured by the microphone, processed by neural networks running on your local hardware, and discarded. No audio is ever uploaded, transmitted, or stored on remote servers.
Your transcriptions stay on your device. The text output of speech recognition is stored locally and is under your control. It is not used to train AI models, not shared with third-party providers, and not accessible to anyone but you.
Your voice biometrics are never captured. Because audio processing happens entirely on-device, your unique vocal characteristics are never transmitted to or stored on any server.
There is no "cloud fallback." Some tools advertise on-device processing but quietly fall back to cloud servers for complex audio or when local processing is slow. Yaps never does this. The architecture is local-only by design, not by configuration.

This is not just a feature - it is an architectural guarantee. There is no server to breach, no API to intercept, no third party to subpoena. Your speech data exists only on your Mac, processed by hardware you physically control.

9.0

How Does Apple Silicon Accelerate Speech Recognition?

Running sophisticated neural networks locally would have been impractical on consumer hardware even five years ago. What changed was Apple Silicon - specifically, the Neural Engine.

What Is Apple Silicon's Neural Engine?

Every Apple Silicon chip (M1 through M4 and beyond) includes a dedicated Neural Engine - a hardware accelerator specifically designed for machine learning inference. The Neural Engine is a distinct processing unit, separate from the CPU and GPU, optimized for the matrix multiplication and tensor operations that neural networks rely on.

The M-series Neural Engine can perform up to 38 trillion operations per second (TOPS) on the M4 chip. To put that in perspective, that is more than enough computational power to run multiple speech recognition models simultaneously in real time while leaving the CPU and GPU entirely free for your other applications.

How Does Core ML Optimize Speech Models for Mac?

Apple's Core ML framework is the bridge between trained machine learning models and Apple Silicon hardware. When Yaps compiles its speech recognition models to Core ML format, several optimizations happen automatically:

Neural Engine scheduling: Inference is routed to the Neural Engine rather than the CPU or GPU, ensuring minimal impact on other workloads
Memory mapping: Model weights are memory-mapped from disk, so they load instantly and share memory with the system efficiently
Hardware-specific optimization: Core ML generates code paths optimized for the specific Neural Engine variant in each chip generation
Batch fusion: Multiple small operations are fused into larger, more efficient operations that better utilize the Neural Engine's parallel processing units

The result is that Yaps can run its entire speech recognition pipeline - noise suppression, acoustic modeling, language modeling, and post-processing - using under 200MB of memory and negligible CPU usage. You can dictate while running heavy applications like Xcode, Final Cut Pro, or large language models without any performance impact on either your dictation or your other work.

What Is Model Quantization and How Does It Help On-Device Speech Recognition?

Our acoustic model was originally trained at 32-bit floating-point precision - the standard for training deep neural networks. For on-device inference, we quantize it to 4-bit precision using a technique called GPTQ (Generative Pre-Trained Transformer Quantization).

Quantization replaces high-precision floating-point numbers with lower-precision integers. At 4-bit precision, each model parameter occupies one-eighth the memory of its 32-bit counterpart. A model that would require 1.6 gigabytes at full precision fits in approximately 200 megabytes after quantization.

The key insight is that neural networks are remarkably tolerant of reduced precision during inference (as opposed to training). The accuracy loss from 4-bit quantization is typically less than 1% on standard speech recognition benchmarks - imperceptible in real-world use. This is what makes it possible to run Whisper-class models on a MacBook Air with 8GB of RAM without breaking a sweat.

GPTQ achieves this by analyzing the weight distributions of each layer and finding the quantization scheme that minimizes the overall error. Unlike naive quantization, which can degrade specific model capabilities, GPTQ preserves the most important weight relationships while aggressively compressing the rest.

<20msProcessing latency

8xModel size reduction via 4-bit quantization

38 TOPSNeural Engine peak (M4)

<200MBTotal memory footprint

10.0

Is On-Device Speech Recognition as Accurate as Cloud?

This is the question that matters most, and the answer has changed dramatically in recent years.

How Has On-Device Accuracy Improved?

Three years ago, there was a meaningful accuracy gap between cloud and local speech recognition. Cloud providers could run larger models on more powerful hardware, and the difference was noticeable - especially for accented speech, noisy environments, and specialized vocabulary.

That gap has effectively closed. Several converging trends made this possible:

Better model architectures: Whisper and its derivatives showed that a single, well-trained model could match or exceed complex multi-model pipelines
Improved quantization techniques: GPTQ and similar methods made it possible to compress models dramatically without meaningful accuracy loss
Apple Silicon Neural Engine: Dedicated ML hardware in consumer devices provided enough computational power to run large models in real time
Larger and more diverse training datasets: The community now has access to training data that rivals what cloud providers had exclusively just a few years ago

In our internal benchmarks, Yaps achieves a word error rate (WER) within 0.5% of leading cloud providers on standard English dictation tasks. On specialized vocabulary that benefits from personalization (legal, medical, technical), Yaps often outperforms cloud alternatives because the personalized n-gram model provides a boost that generic cloud models cannot match.

What About Edge Cases?

Cloud models still hold advantages in specific scenarios: heavily accented speech in rare language pairs, extremely noisy environments with overlapping speakers, and real-time translation between languages. These are areas where sheer model size and training data volume still matter.

For the primary use case of dictating text on your Mac - emails, documents, notes, code comments, messages - on-device models have reached parity. The remaining gaps are narrow and narrowing further with each model generation.

11.0

What Makes Yaps Different from Other Speech-to-Text Tools?

Yaps is purpose-built for a specific use case: private, real-time dictation that can run on-device. Every architectural decision flows from that focus:

Native desktop architecture: Built with Tauri v2 and Rust, not wrapped in Electron. This means lower memory usage, faster startup, deeper system integration, and proper platform conventions.
Under 200MB memory footprint: Thanks to 4-bit quantization and efficient Core ML inference, Yaps uses a fraction of the memory of comparable tools.
Instant startup: The app launches in under a second and is ready to transcribe immediately. No loading screens, no model warm-up delays.
Offline-capable and private by design: Core dictation can run without an internet connection. Audio stays on your device for offline workflows, with no API keys or mandatory remote speech service.
Apple Silicon Neural Engine optimized: Models are compiled specifically for the Neural Engine, leaving your CPU and GPU free.
Dictation-tuned VAD: Voice Activity Detection designed for long-form dictation, not short conversational turns.
Personalized vocabulary: Local n-gram models that adapt to your specific terminology over time.
Smart history: Your dictation history is stored locally and searchable, so you can find and reuse previous transcriptions.

For a full walkthrough of these features and how they work together, see our introduction to Yaps.

12.0

What Is the Future of Speech Recognition Technology?

Speech recognition has improved more in the last three years than in the preceding thirty. But we are still in the early innings. The trends shaping the next generation of this technology include:

Multi-speaker recognition - distinguishing between different speakers in a meeting and attributing text correctly, all processed on-device
Contextual awareness - understanding which application you are using and adapting formatting accordingly (Markdown in a code editor, formal prose in a legal document)
On-device voice cloning - using a few minutes of your speech to create a personalized text-to-speech voice that sounds like you, without uploading voice samples
Real-time local translation - speak in one language, have text appear in another, all processed by on-device models
Larger on-device models - as Apple Silicon Neural Engines grow more powerful and memory becomes cheaper, the models that can run locally will continue to grow in capability
Improved noise robustness - next-generation spectral masking and beamforming techniques that work even in the noisiest environments

The trajectory is clear: the capabilities that once required cloud infrastructure are steadily migrating to local hardware. Within a few years, there will be no accuracy-based reason to send voice data to remote servers. The only question is whether users will demand that their tools make this shift.

01 · Try Yaps

A voice keyboard that keeps your voice on your phone.

Install Yaps on Android for offline dictation, a familiar full-size keyboard, and no screen capture. Scan the QR on desktop, or tap the Play badge on mobile.

GET IT ON Google Play or yaps.ai/android

13.0

Frequently Asked Questions About Speech Recognition

What are the main components of a speech recognition system?

Five components: (1) audio capture and preprocessing (noise suppression, voice activity detection); (2) feature extraction (log-mel spectrograms); (3) an acoustic model — typically a transformer or conformer network — that maps audio to linguistic units; (4) a language model that resolves ambiguity using grammar and word-sequence probability; (5) a decoder and post-processor that picks the most likely transcription and adds punctuation, capitalisation, and formatting. Classical HMM systems kept these separate; modern end-to-end models like Whisper collapse stages 2–4 into a single network.

What are the basics of speech recognition technology?

Speech recognition converts acoustic vibrations into text by predicting which sequence of words most likely produced the audio. The pipeline is: digitise the sound, extract features (a frequency-vs-time spectrogram), run a trained neural model on those features, and post-process the output into readable text. The technology dates to the 1950s, but the modern transformer-era arrived between 2020 and 2024. The decisive shift is that ASR models are now small enough to run entirely on a phone or laptop — removing the need to send audio to the cloud at all.

How does speech recognition work?

Speech recognition works by converting audio into text through a multi-stage pipeline. First, raw audio is captured and cleaned through noise suppression and voice activity detection. Then, an acoustic model (typically a transformer neural network) converts the audio signal into linguistic features by analyzing a spectrogram representation. A language model refines the output by resolving ambiguities and applying contextual understanding. Finally, post-processing adds punctuation, capitalization, and formatting. Modern systems like Yaps run this entire pipeline locally on your Mac using Apple Silicon's Neural Engine.

What is OpenAI Whisper and can it run on a Mac?

OpenAI Whisper is an open-source speech recognition model trained on 680,000 hours of multilingual audio data. It uses a transformer encoder-decoder architecture and achieves near-human accuracy across dozens of languages. Yes, Whisper can run locally on a Mac through projects like whisper.cpp (a C/C++ implementation) and Apple's WhisperKit (which compiles Whisper to Core ML for Apple Silicon). Yaps uses an optimized, Whisper-derived architecture that has been fine-tuned specifically for real-time macOS dictation with 4-bit quantization for minimal memory usage.

Can speech-to-text work offline?

Yes, modern speech-to-text can work fully offline with no internet connection required. On-device speech recognition tools like Yaps run the entire transcription pipeline locally on your hardware. The neural network models, language models, and post-processing all execute on your Mac's Apple Silicon chip. There is no "degraded mode" - offline accuracy is identical to online accuracy because the same models run regardless of network connectivity. This means you can dictate on airplanes, in areas without cell service, or in secure facilities with full accuracy.

Is offline speech recognition as accurate as cloud-based services?

In 2026, yes - for standard dictation tasks, on-device speech recognition has effectively reached parity with cloud services. Advances in model architecture (particularly Whisper-derived models), quantization techniques like GPTQ, and dedicated ML hardware like Apple Silicon's Neural Engine have closed the gap. Yaps achieves a word error rate within 0.5% of leading cloud providers on standard English dictation. For specialized vocabulary, on-device tools with personalization can actually outperform cloud alternatives.

How does Apple Silicon's Neural Engine help with speech recognition?

Apple Silicon's Neural Engine is a dedicated hardware accelerator designed specifically for machine learning inference. It can perform up to 38 trillion operations per second (on the M4 chip), providing more than enough computational power to run multiple speech recognition models simultaneously in real time. When speech models are compiled to Apple's Core ML format, they run on the Neural Engine rather than the CPU or GPU, meaning dictation consumes minimal system resources and does not impact other running applications.

What is model quantization in speech recognition?

Model quantization is the process of reducing the numerical precision of a neural network's parameters to decrease memory usage and increase inference speed. Yaps uses 4-bit GPTQ quantization, which reduces each model parameter from 32-bit floating-point to 4-bit integer precision - an 8x reduction in model size. A model requiring 1.6GB at full precision fits in approximately 200MB after quantization, with less than 1% accuracy loss on standard benchmarks. This is what makes it possible to run large, accurate speech models on consumer hardware like a MacBook Air.

Why is privacy important for speech-to-text tools?

Speech data is uniquely sensitive - it contains both biometric identifiers (your unique voice characteristics) and the content of your communications (which may include confidential business information, medical records, legal correspondence, or personal thoughts). The track record of cloud providers with voice data includes Google's $68 million settlement for recording private conversations, lawsuits against companies like Fireflies.AI for harvesting biometric voice data without consent, and 276 million healthcare records breached in 2024. On-device processing eliminates these risks entirely by ensuring your audio never leaves your physical device.

How does Yaps compare to other macOS dictation tools?

Yaps is designed as a privacy-first, offline-first voice app for macOS, Windows, and Android. Unlike cloud-dependent tools like Wispr Flow (which sends audio to OpenAI/Meta servers and captures screenshots) or Granola AI (which routes transcriptions through OpenAI/Anthropic), Yaps can process core workflows locally with zero data transmission. Unlike Electron-based alternatives, Yaps is built with Tauri v2 and Rust for minimal resource usage - under 200MB of memory with instant startup on desktop. It includes features like personalized vocabulary learning, smart dictation history, voice activity detection tuned for long-form dictation, cleanup for broken dictated text, text-to-speech, synced notes, and a studio editor for refining transcriptions.

What speech recognition features is Yaps working on next?

Yaps is actively developing multi-speaker recognition (distinguishing different speakers in meetings), contextual awareness (adapting formatting based on which application you are using), on-device voice cloning (creating a personalized text-to-speech voice from a few minutes of your speech), and real-time local translation (speaking in one language and seeing text in another). All of these features will run entirely on-device, maintaining Yaps's commitment to privacy-first, offline-first architecture.

How do I get started with Yaps on my Mac?

Yaps is available on macOS, Windows, and Android. On Mac, it requires macOS 13 Ventura or later, is optimized for Apple Silicon, and also runs on Intel Macs. Installation is straightforward - download from yaps.ai, install the app for your platform, and start dictating. No cloud configuration is needed for offline-capable workflows. The desktop app launches in under a second and is ready to transcribe immediately. Hold Fn and speak to begin dictating anywhere on your Mac.