MOSS-TTS-Nano: 0.1B Parameter Model Brings Multilingual Speech Synthesis to CPU

MOSI.AI and the OpenMOSS team released MOSS-TTS-Nano on April 10, 2026, a text-to-speech model with only 100 million parameters that runs in real-time on CPU hardware without requiring GPU acceleration. The model supports 20 languages including Chinese, English, and multiple European and Asian languages, generating 48 kHz stereo audio output.

Pure Autoregressive Architecture Eliminates Traditional Vocoding

MOSS-TTS-Nano uses a pure autoregressive Audio Tokenizer plus LLM pipeline instead of traditional neural vocoding approaches. The MOSS-Audio-Tokenizer-Nano component compresses audio into a 12.5 Hz token stream using residual vector quantization (RVQ) with 16 codebooks, supporting variable bitrates from 0.125 to 2 kbps. This unified discrete audio interface maintains compatibility across the entire MOSS-TTS model family.

CPU-Only Deployment Targets Edge and Budget-Constrained Applications

The model's 0.1B parameter count enables deployment scenarios where GPU access is unavailable or cost-prohibitive. MOSS-TTS-Nano supports multiple deployment interfaces including Python scripts, FastAPI web applications, and command-line tools. Real-time streaming inference with low latency makes it suitable for local demonstrations, web serving, and lightweight product integration.

Voice Cloning Through Reference Audio Samples

MOSS-TTS-Nano includes voice cloning capabilities through reference audio samples, with automatic chunked processing for long-form text synthesis. Users can provide sample audio to clone specific voice characteristics without additional training. The model's architecture maintains voice consistency across extended generation tasks through its chunking mechanism.

Key Takeaways

MOSS-TTS-Nano contains only 100 million parameters and runs real-time speech synthesis on CPU without GPU requirements
The model supports 20 languages and outputs 48 kHz stereo audio using a pure autoregressive pipeline
Voice cloning functionality works through reference audio samples with automatic chunking for long-form text
Multiple deployment options include Python APIs, FastAPI web apps, and CLI tools for different integration scenarios
The GitHub repository reached 208 stars within days of the April 10, 2026 release

Pure Autoregressive Architecture Eliminates Traditional Vocoding

CPU-Only Deployment Targets Edge and Budget-Constrained Applications

Voice Cloning Through Reference Audio Samples

Key Takeaways

MOSS-TTS-Nano contains only 100 million parameters and runs real-time speech synthesis on CPU without GPU requirements

The model supports 20 languages and outputs 48 kHz stereo audio using a pure autoregressive pipeline

Voice cloning functionality works through reference audio samples with automatic chunking for long-form text

Multiple deployment options include Python APIs, FastAPI web apps, and CLI tools for different integration scenarios

The GitHub repository reached 208 stars within days of the April 10, 2026 release