What type of data do I need to provide to clone a voice?

Resemble AI supports four ways to upload custom datasets to the platform

Resemble AI Supported Datasets

In any of the scenarios below, we recommend uploading at least 20 minutes of audio data for the standard North American English accent and 45-60 minutes for or all other languages or regional English accents. We can only accept single speaker audio. Multi-voice speaker data sets are not supported.

Single Audio File

Upload a single audio file in RIFF (.wav) PCM, 16-bit or 24-bit format at 8khz, 16khz, 22khz, 44khz or 48khz sampling rate.

Single Audio File + Transcript

Upload a zip or tarball that contains a single audio file and a transcript file. The audio file must be in RIFF (.wav) PCM, 16-bit or 24-bit format at 8khz, 16khz, 22khz, 44khz or 48khz sampling rate. The transcript must be a .txt file.

Multiple Audio Files + Transcripts

Upload a zip or tarball that contains multiple audio files and a transcript file in a CSV format. The audio file must be in RIFF (.wav) PCM, 16-bit or 24-bit format at 8khz, 16khz, 22khz, 44khz or 48khz sampling rate. The transcript must be a CSV.

Each audio file should be between 1.5 to 15 seconds in duration.

Folder Structure

The folder structure you upload must contain a wavs folder and a metadata.csv file. For example:

data/
metadata.csv
wavs/
wav1.wav
wav2.wav
wav3.wav

Transcript Details

The metadata.csv must be split by | and should contain the base filename and the transcription. See the following example (** note that we only use the base filename and remove the extension):

wav1|this is what is in my file
wav2|please remove the extensions
wav3|each file should be between 1.5 to 15 seconds long

Multiple Audio Files

Upload a zip or tarball that contains multiple audio files. The audio file must be in RIFF (.wav) PCM, 16-bit or 24-bit format at 8khz, 16khz, 22khz, 44khz or 48khz sampling rate.

Each audio file should be between 1.5 to 15 seconds in duration.