What are best practices for Text-to-Speech?

Tips & tricks for getting the output you need.

This guide introduces specific techniques for directing your Resemble voice models to produce audio output that sounds more like natural speech.
For instance, when saying alphabets like A, B, C..., or acronyms such as NASA or ASAP, you should consider adding slight pauses.
And when dealing with numbers, say, 1, 2, 3..., ensure you use proper punctuation for better pacing.

Techniques for Natural Speech Output

  1. Punctuation for Pauses:

    • Commas (,): Use commas to create brief pauses within sentences, mimicking natural breathing and speech patterns.
      Example: "I went to the pet store, bought some goldfish, and returned home."

       

    • Periods (.): Use periods to signal the end of a sentence, creating a longer pause and a clear break between thoughts.

      Example: "I went to the pet store. I bought some goldfish. I returned home."
  2. Filler Words for Natural Flow:

    • Common Fillers: Include words like "uh," "um," "well," and "you know" to make speech sound more conversational and less robotic.
      Example: "Um, let me check that for you, uh, one moment please."
  3. Adjusting Rhythm and Pacing:

    • Variable Sentence Lengths: Mix short and long sentences to create a more engaging and dynamic speech pattern.
      Example: "I love making music. It's creative, yet demanding."
    • Intentional Pauses: Use ellipses (...) or dashes (—) to create intentional pauses for dramatic effect or to emphasize certain points.
      Example: "The outcome was...surprising."
  4. Emphasis on Key Words:

    • Capitalization: Capitalize words that need emphasis in the text to prompt Resemble.ai to highlight them in speech.
      Example: "This is REALLY incredible."

Pronunciation control

While we do not offer pronunciation control as part of our API, you can create spelled-out words as they are spoken and include them as part of the LLM prompt or part of text normalization.

  • For example, Siobhan can be spoken as Shauvaughn.
    Example: "Can I confirm that your name, spelled Ess Eye Owe Beee Aitch Eigh En, is pronounced as Shauvaughn?

Emotion control

To make your voice clone sound emotional, write your text in a storytelling style like a book. Look for words and phrases in books that express the feelings you want.

For example, use tags like "she said, sadly" or "they shouted fearfully" to help the voice model understand the emotion you're aiming for. This helps create customized voice recordings for different uses.

For instance

  • Example: "Are you sure about that?" he asked, sounding unsure.
  • Example: "Don’t scare me!" she said fearfully.

Remember not to include these tags in the final script for the AI to read aloud.
Although the voice clone can sometimes figure out emotions from the context of the text, it doesn't always get it right.

Alphabet

While we are currently working on improving alphabet pronunciation (A-Z), if you encounter issues with pronunciation in your use case, we suggest using the following spelled-out words as your text input. You can include this as part of your prompt in your LLM, or use text normalization.

  • You can add natural pauses between groups of 2 - 4 alphabets to include pauses.
    Example: "To confirm, is your referral code Queue Why. Eigh Beee?"

Suggested Text Input: "The alphabets are Eigh, Beee, Sea, Deee, Eeeee, Eff, Geee, Aitch, Eye, Jay, Kay, Elle, Emm, En, Owe, Peee, Queue, Ar, Ess, Teee, Yue, Veee, Double Yue, Eks, Why, Zeee."

Alphabet Pronunciation Guide

Letter Phonetic Word
A Eigh
B Beee
C Sea
D Deee
E Eeeee
F Eff
G Geee
H Aitch
I Eye
J Jay
K Kay
L Elle
M Emm
N En
O Owe
P Peee
Q Queue
R Ar
S Ess
T Teee
U Yoo or Yue
V Veee
W Double Yoo or Yue
X Eks
Y Why
Z Zeee
  • You can use these phonetic words to improve alphabet pronunciation in your text inputs.
    Example: "To confirm, is your referral code Queue Why. Eigh Beee?

Acronyms

In most cases, acronyms can be handled by just providing the letters of the acronym. Given some acronyms are pronounced as a word (e.g, NASA), while others aren't (e.g, MLB), our model will attempt to pronounce the acronym correctly in your audio output.
Separating the letters and punctuation can also improve the acronym pronunciation.


Example:"I love watching MLB Baseball." 

Numbers

Depending on how you want numbers to be spoken in your audio output, consider using the following prompts for number pronunciation.

Explicitly add the word and to tell the model to pronounce the entire phrase as "Fifteen hundred and fifteen". Otherwise, the model will pronounce it as "one thousand five hundred and fifteen".

  • Example: "The total is 1515, or fifteen hundred and fifteen"

     

Words in different languages

If you are having trouble with words rooted in different languages in your text, you can try spelling the word out phonetically and your voice model will attempt to pronounce the word correctly in your audio output. However, in some cases your voice model can pronounce these words correctly without the phonetic spelling. 

  • Example:"I want to rendezvous with you."
  • Example:"I want to rahndayvoo with you."