An e-learning company converted 60 hours of course text to audio in 2019 using a commercial TTS service: $0.016 per character, robot monotone, no natural pauses, 73% of learner survey respondents said "audio was distracting." In 2024 they ran the same 60 hours through a neural TTS system. Cost: $0.000030 per character (533× cheaper). Learner survey: 68% said audio was "as natural as a human narrator." The underlying technology changed completely in five years.
Neural TTS (used in this tool) differs from concatenative TTS in one key way: instead of stitching together recorded phoneme samples, it generates a mel-spectrogram from text using a transformer model, then converts that spectrogram to audio waveform using a vocoder. This produces prosody (rise and fall of pitch) that matches sentence meaning rather than individual words in isolation.
Format Reference: Which Output to Choose
| Format | Size (1 min speech) | Best for |
|---|---|---|
| MP3 128 kbps | ~960 KB | Web playback, podcast, mobile |
| MP3 64 kbps | ~480 KB | Bandwidth-constrained playback |
| WAV 16-bit 22 kHz | ~2.5 MB | Further audio editing |
| OGG Vorbis | ~700 KB | Open-source projects, web |
Where Neural TTS Still Struggles
- Proper nouns and acronyms:"SQL" is pronounced "sequel" by most developers but "S-Q-L" in some contexts. Neural TTS picks one and cannot infer which is correct. Use phonetic spelling in your input text if you need a specific pronunciation.
- Numbers and units:"3.5" might be read as "three point five" or "three and a half". "1,000" might be read as "one thousand" or "one comma zero zero zero" depending on locale settings.
- Emotional range: Neural TTS can produce warm, neutral, or energetic — it cannot produce grief, sarcasm, or controlled anger convincingly. For emotionally demanding narration, a human voice actor still outperforms.
- Languages with tonal systems: Mandarin Chinese, Thai, and Vietnamese require correct tones for meaning. Neural TTS quality varies significantly by language; check with a native speaker before publishing.
Practical Input Tips
Write your text the way you want it spoken. Use full stops to create pauses. Spell out abbreviations. Break long sentences into two shorter ones — neural TTS handles 15-word sentences better than 40-word ones. Avoid em-dashes inside sentences (the model pauses inconsistently at them); use commas or split into separate sentences instead.
