I was 11 clips into a faceless history video when I realized I'd spent 47 minutes recording, re-recording, and trimming my own voiceover — all for 3 minutes of usable audio. That's when I actually committed to CapCut's AI voice tools instead of treating them as a backup. This guide covers how to set up CapCut text to speech on every platform, what voice categories exist, what controls you get, how voice cloning works, and a specific bug you will hit if you ever try to trim a voice clip after applying pitch.

CapCut AI Voice Text to Speech Setup on Mobile

Most tutorials tell you to find the TTS button in the Audio panel. That's wrong for mobile — it isn't there.

On iOS and Android, CapCut text to speech is tied to text layers, not audio clips. Here's the actual path:

  1. Open your project and tap Text in the bottom toolbar.
  2. Tap Add text and type your script.
  3. Tap the text layer on the timeline to select it.
  4. Scroll right along the bottom toolbar until you see the speaker icon labeled Text to Speech.
  5. Tap it. A voice selection panel opens. Preview voices by tapping their name, then tap Apply or Apply to all for multiple clips.

CapCut generates the audio and places it as a separate clip below your text layer on the timeline. You can mute or delete the text layer and work with the audio track on its own.

One thing most tutorials skip: Apply to all applies the same voice to every text layer in the project using identical settings. Useful for chapter-by-chapter narration. Less useful if different sections need different tones — you'll need to apply voices individually in that case.

CapCut AI Voice Generator Setup on Desktop and Web

Desktop and web handle this differently from mobile, and slightly differently from each other.

Desktop (Windows and macOS):

  1. Open your project. Go to the Text panel on the left and drag Add Text onto the timeline.
  2. Type your script in the text box on the right-side panel.
  3. With the text layer selected, find Text to Speech in the right panel below the font controls.
  4. Click it, choose a voice, and click Start reading.

Web (capcut.com):

The web editor includes a standalone CapCut AI voice generator tool outside the main editor. Go to capcut.com/tools/text-to-speech, paste your script into the text box, pick a voice from the right panel, and click Generate. Download the output as MP3, or as MP3 plus SRT caption file. This workflow is faster for scripted voiceovers when you're not doing complex editing in the same session.

Inside the full web editor, look for the AI Voice section in the dashboard, then select Text to Speech to work within a project.

The CapCut custom AI voice tool is a separate web page that lets you build a persistent voice profile across projects — more on that in the cloning section below.

All CapCut TTS Voice Categories and Names

CapCut advertises 200+ AI voices, but the exact catalog changes by region, language, app version, and account type. There's no fixed public list that stays stable — new voices get added, some get renamed, and availability varies. The voice selection panel inside the editor is always the most accurate view of what you actually have access to.

What does stay consistent is the category structure. Voices split into three groups.

Narrator voices are the workhorses for faceless content, tutorials, and explainer videos. Named options include Male Storyteller, Female Storyteller, Professor, and Serious Female. Professor runs at a noticeably measured cadence — useful for educational content where deliberate pacing matters. Serious Female is the one you hear on a lot of finance and true crime TikToks.

Character voices are CapCut's answer to the TikTok character voice trend. Available names include Jessie, Bestie, Trickster, Kawaii Vocalist, Anime Girl, Kiddo, and Witty. Most skew younger and suit Shorts and Reels better than long-form content. Witty is the closest to a natural conversational tone without tipping into obvious character territory.

Effect voices apply transformations on top of the speech synthesis: Robot, Chipmunk, Elf, Santa, Ghost, and Zombie. These aren't about realistic narration — they fit comedic or horror-adjacent content. Robot is the one creators actually use regularly, particularly for tech tutorials or gaming content where a synthetic sound works with the aesthetic rather than against it.

Multilingual voices cover 20+ languages with multiple styles per language. English, Spanish, French, Portuguese, and Japanese have the widest selection. The voice count per language drops off quickly for less common options. Filter by language inside the voice selection panel before browsing.

CapCut AI Voice Speed, Pitch, and Customization Controls

After generating a voice clip, you can adjust speaking speed from 0.5x to 2x inside the voice settings. 1.2x is a common working choice for narration that sounds slightly up-tempo without feeling rushed. Pitch has a slider but no displayed numerical value — you're working by ear, not by semitones.

Volume and preview are both available before you commit on the web tool. The 5-second preview button in the panel is accurate to final output, which saves you from generating a full clip only to discover the voice isn't right.

Punctuation affects pacing in a way most tutorials don't mention. A comma adds a short pause, a period adds a full stop, and a question mark adds upward inflection on most voices. If a voice is reading too flat, adding a comma where you'd naturally pause in speech tends to loosen it up.

There's no word-level speed control. You get one speed setting for the entire clip. That's the main practical gap between CapCut's built-in TTS and dedicated voice tools.

The CapCut AI Voice Pitch Reset Bug — and How to Fix It

This one is documented in CapCut's own community forums and it catches people regularly. If you apply a pitch or speed change to an AI voice clip and then trim or split it, the effect resets. The clip still looks like the adjustment is applied, but on export the audio reverts to its original pitch.

It affects both desktop and mobile, though it shows up more often on desktop.

The fix: apply your pitch and speed settings to the full uncut clip first. Mute all other tracks and export just that audio. Reimport the exported file back into your project. Now you can trim and split freely because the effect is baked into the file rather than applied as a layer that can decouple.

If you only need a tonal shift rather than a precise pitch value, use Voice Effects instead of the pitch slider. Go to Audio > Voice Effects and pick something like Deep, Chipmunk, or Robot. These render differently from the pitch slider and don't reset on trim. Not a replacement for precise pitch control, but it skips the export-reimport step for simple adjustments.

CapCut Voice Cloning Setup and What It Actually Costs

Voice cloning in CapCut sits behind a Pro subscription. CapCut's pricing varies by region, device, and current promotions — iOS App Store prices differ from Android and web checkout prices, and regional rates vary further. Check the Upgrade screen inside your own account for the current figure before planning a workflow around it.

The cloning process itself: record or upload a clean audio sample of your voice. The 2026 update reduced the required length — roughly 10 seconds of clear audio is enough to generate a usable clone. CapCut analyzes your pitch, cadence, and tone, then lets you type any script and produce audio in your voice. The clone works across languages, which helps multilingual channels maintain a consistent sound without re-recording in every language.

Some premium AI voice features in CapCut consume credits. Credit allocation varies across Pro plan types — monthly, annual, Pro+, and Teams plans receive different monthly amounts. Per-feature credit costs for voice cloning aren't fully disclosed on the public pricing page. Check your Credits panel before running long scripts to avoid surprises mid-project.

The cloned voice appears under My Voices inside the TTS panel — same access path as any other voice, just in your personal library rather than the public catalog.

For creators running faceless channels where the host voice is part of the brand, this is the feature that separates a consistent identity from content that sounds like every other account using the same public narrator preset.

CapCut Voice Changer vs CapCut Text to Speech

These two tools solve different problems and they're often confused with each other.

CapCut text to speech takes written text and generates a voice from scratch. No recording required. You type, it speaks.

CapCut voice changer takes existing recorded audio and applies a transformation to it. You record yourself or import audio, then reshape it. Access it via Audio > Voice Changer in the editing panel. Options include Voice Filters (tonal adjustments), Voice Characters (stylized effects), and Speech to Song (converts speech to a melodic format).

Voice changer tends to produce a more natural result when the source recording is clean, because it works from a real voice rather than synthesizing one. TTS is faster and requires no microphone setup. Most faceless creators use TTS; creators who record themselves but want to alter the result use the changer.

When CapCut AI Voice Isn't Enough

CapCut TTS works well for short-form content where voice is texture — quick Reels, TikTok story formats, YouTube Shorts narration. The voices are good enough for a 60-second clip where most viewers won't register the AI origin.

For longer content, the limitations become audible. No prosody control, no word-level emphasis, and the emotional range on most voices is narrow. A 10-minute explainer using the Professor voice will sound flat by the 4-minute mark. In those situations, a dedicated voice tool like ElevenLabs gives you more expressive output with per-word speed and emphasis controls — at the cost of an extra step to export audio and import it into CapCut separately.

If you're primarily making short-form content and already editing in CapCut, the built-in TTS is the practical choice. If voice quality is a meaningful part of the content itself — explainer series, narrated documentaries, YouTube channels where listeners are evaluating your credibility — a dedicated tool is worth the additional workflow step.

CapCut AI Voice FAQ

Is CapCut text to speech free?

Yes. The standard TTS voices are available on the free plan with no watermark on exported audio. Voice cloning is the feature that requires a paid subscription.

How many voices does CapCut have?

CapCut's official TTS page lists 200+ AI voices. The actual number you see depends on your region, app version, and account type. The catalog gets updated without notice — some voices get added, some renamed. Use the preview panel inside the editor rather than relying on any external list as a reference.

Can I use CapCut AI voice for commercial content?

CapCut's TTS page states that generated audio can be used for commercial projects, but the actual rights depend on the specific voice or material, your account type, region, and CapCut's current license terms. For client work, ads, or brand campaigns, verify that the specific voice you're using is marked for commercial use rather than assuming all voices carry the same rights by default.

Does CapCut AI voice work offline?

No. TTS generation requires an internet connection. The feature runs on CapCut's servers, not locally on your device.

How do I stop the pitch effect from resetting when I trim a voice clip?

Apply pitch and speed to the full uncut clip first. Mute all other tracks, export that audio, then reimport the file before trimming. Once the effect is baked into the file rather than applied as a layer, trimming won't reset it.

Which CapCut AI voice works best for TikTok?

Serious Female and Jessie are the two that show up most in trending TikTok narration formats. Serious Female works for straight narration-heavy content; Jessie fits conversational or comedic formats. Both are free. They're recognizable enough that regular TikTok viewers associate them with the format rather than noticing the AI origin — which cuts both ways depending on whether you want your content to sound genre-specific or distinct.

What's the difference between CapCut voice changer and text to speech?

Text to speech generates audio from typed text — no recording needed. Voice changer transforms existing recorded audio. If you record your own voice and want to alter its tone or style, use the voice changer. If you want narration without recording at all, use text to speech.

Embedded JavaScript