Resemble AI supports four ways to upload custom datasets to the platform
Resemble AI Supported Datasets
In any of the scenarios below, we recommend uploading at least 20 minutes of audio data for the standard North American English accent and 45-60 minutes for or all other languages or regional English accents. We can only accept single speaker audio. Multi-voice speaker data sets are not supported.
Single Audio File
Upload a single audio file in RIFF (.wav) PCM, 16-bit or 24-bit format at 8khz, 16khz, 22khz, 44khz or 48khz sampling rate.
Single Audio File + Transcript
Upload a zip or tarball that contains a single audio file and a transcript file. The audio file must be in RIFF (.wav) PCM, 16-bit or 24-bit format at 8khz, 16khz, 22khz, 44khz or 48khz sampling rate. The transcript must be a .txt file.
Multiple Audio Files + Transcripts
Upload a zip or tarball that contains multiple audio files and a transcript file in a CSV format. The audio file must be in RIFF (.wav) PCM, 16-bit or 24-bit format at 8khz, 16khz, 22khz, 44khz or 48khz sampling rate. The transcript must be a CSV.
Each audio file should be between 1.5 to 15 seconds in duration.
Folder Structure
The folder structure you upload must contain a wavs folder and a metadata.csv file. For example:
data/
metadata.csv
wavs/
wav1.wav
wav2.wav
wav3.wav
Transcript Details
The metadata.csv must be split by | and should contain the base filename and the transcription. See the following example (** note that we only use the base filename and remove the extension):
wav1|this is what is in my file
wav2|please remove the extensions
wav3|each file should be between 1.5 to 15 seconds long
Multiple Audio Files
Upload a zip or tarball that contains multiple audio files. The audio file must be in RIFF (.wav) PCM, 16-bit or 24-bit format at 8khz, 16khz, 22khz, 44khz or 48khz sampling rate.
Each audio file should be between 1.5 to 15 seconds in duration.