How do I make sure my voice clone actually sounds good?

Here are some best practices that we recommend to ensure a good quality voice model

What type of data do I need to for a professional voice clone?

Resemble AI supports 4 ways to upload custom datasets to the platform:

Resemble AI Supported Datasets

In any of the scenarios below, we recommend uploading at least 20 minutes of audio data for the standard North American English accent.We can only accept single-speaker audio datasets. Multi-voice speaker datasets are not supported.

For cloning in other languages, this is available to our Enterprise level customers. Please see our pricing page for further details. 

Single Audio File**

Upload a single audio file in RIFF (.wav) PCM, 16-bit or 24-bit format at 8kHz, 16kHz, 22kHz, 44kHz, or 48kHz sampling rate.

**This is our best practice recommendation when building a professional voice clone. 


Single Audio File + Transcript

Upload a zip or tarball that contains a single audio file and a transcript file. The audio file must be in RIFF (.wav) PCM, 16-bit or 24-bit format at 8kHz, 16kHz, 22kHz, 44kHz, or 48kHz sampling rate. The transcript must be a .txt file.

Multiple Audio Files + Transcripts

Upload a zip or tarball that contains multiple audio files and a transcript file in a CSV format. The audio file must be in RIFF (.wav) PCM, 16-bit or 24-bit format at 8khz, 16khz, 22khz, 44khz or 48khz sampling rate. The transcript must be a CSV.

Each audio file should be between 1.5 seconds to 15 seconds in duration.

Folder Structure

The folder structure you upload must contain a "wavs" folder and a "metadata.csv" file.

For example:

data/
metadata.csv
wavs/
wav1.wav
wav2.wav
wav3.wav

Transcript Details

The "metadata.csv" must be split by "|" and should contain the base filename and the transcription. See the following example (note that we only use the base filename and remove the extension):

wav1|this is what is in my file
wav2|please remove the extensions
wav3|each file should be between 1.5 to 15 seconds long

Resemble AI Supported Datasets


Environment Requirements

Mic

There are different kinds of mics on the market, but for good recordings, we would ideally choose a mic that can cover the frequency range 20 Hz to 20,000 Hz (i.e. 20 kHz). Additionally it would best to choose mics that are unidirectional (such mics are often sold as having a "Cardioid polar pattern").

It's best to avoid mics that claim to be "omnidirectional" i.e. receiving sound from all directions.

If conducting interviews where the talent and speaker can sit opposite each other - a "figure of 8" mic might also be suitable. This mic allows each speaker to speak into either side of the same mic, but allows the different speakers to be separated easily into different files. If you record this way - please leave us a note that this approach was used.

Position of the mic

A unidirectional mic rejects any sound coming from its rear, so when recording, the speaker is ideally facing the mic, and the mic is facing away from any noise sources in the room (sound leaking window, air-conditioners).

Sitting next to any sound reflective surfaces can also degrade the recording (for e.g. sitting next to a window, or a concrete wall). Dry wall is less of a problem but a good rule of thumb is to leave at least 2 feet of distance from any wall.

Echo

To help minimize the effect of reflection, ideally choose rooms that have walls made of dry-wall, gypsum board, MDF (Medium Density Fiberboard), or unpolished wood. Other walls draped with curtains also work well to minimize echo. It's very important to minimize any stone or glass surfaces like glass or stone surfaced tables.

If this is unavoidable, a simple trick would be to try and cover any glass, stone or polished surface with a thick sweater, scarf, or blanket.

If you have a unidirectional mic at hand, it can help a little to point the rear of the mic to this surface as well.

External Noises

External noises bleed into recordings from what are known as 'flanking paths'. Anything that allows air to leave or enter the room can be a flanking path, but we only need to be mindful of the widest/loudest. Listening through a pair of headphones while pointing your unidirectional mic around the room might help you locate a flanking path and allow you to tackle it before you record.

Air-conditioners can be a major problem, so it's best to set the room to a desired temperature an hour or so before recording and switch off the AC or Fan while you record.

Recording Levels

Recording softwares typically have a Level Meter. It's best to turn up the preamp knob for the mic (sometimes called gain, or volume) until your loudest speaking volume achieves -6 to -3 dB on this meter.

It is unadvisable to exceed 80% of the full gain of the pre-amp. Cheap recording hardware can introduce noise beyond these limits.

Interviews / Multiple Speakers

In some professional productions, a sound-mixer may be present. In such situations where there are multiple speakers present - it is usually practice to have the main talent recorded on the left or right channel of a Stereo audio file, and any/all irrelevant speakers (interviewers, anchors, etc) recorded on the other channel. This allows a clean separation of the talent's voice when needed without complicating the format too much. A sound professional will usually be able to provide this format if requested, even after the recording day has gone by.

When there are more than 2 speakers, it is highly advisable to have a location recordist mic each speaker and record in a poly-wav format. This allows any/all of the recorded voices to have their own separate audio file, with minimal bleed of sound.

Post Processing

Audio that goes through a sound house will very likely involve some kind of sweetening process that prepares the voice for release on different media (radio, tv, podcast, youtube). Such processing can make two sets of data for the same speaker sound slightly different.

If the sound house has retained a copy of the unprocessed original audio, this is our best case scenario. One can always process the cloned voice audio to suit the requirement.

For good voice clones, we currently prefer audio that isn't processed by any of the following:

  1. Analogue Emulation softwares or Exciters
    • These are typically used to add warmth, presence and character to voices.
  2. Compressors and Equalizers:
    • Compressors squash the dynamic range making it harder to isolate the voice from any background noise.
    • Equalizers can change the audio to sound more clear, bassy, husky, or crisp. This can be a powerful tool to match different recordings or different mics, but can also introduce unwanted variation between recordings.