Here are some best practices that we recommend to ensure a good quality voice model
In general, the data recorded by the customer should be according to the scenario and domain in which the voice will be used. For example, recording a conversational dialogue is ideal for situations that would sound conversational, i.e. interviews, phone conversations, etc. versus reading fictional audiobooks should record passages from fictional work like Harry Potter.
We need the audio to be recorded in wav/aiff/flac formats. It would be most useful for us to get the data at 24-bit (preferably) or 16-bit resolution at the least.
Minimum Audio Duration
The dataset should be a minimum of 30 minutes of audio for the ideal voice. In some scenarios, lower amounts of data will suffice.
Valid Formats for Uploading Data
There are 4 valid formats for uploading your own dataset. You can learn more by here:
There are different kinds of mics on the market, but for good recordings, we would ideally choose a mic that can cover the frequency range 20 Hz to 20,000 Hz (i.e. 20 kHz). Additionally it would best to choose mics that are unidirectional (such mics are often sold as having a "Cardioid polar pattern").
It's best to avoid mics that claim to be "omnidirectional" i.e. receiving sound from all directions.
If conducting interviews where the talent and speaker can sit opposite each other - a "figure of 8" mic might also be suitable. This mic allows each speaker to speak into either side of the same mic, but allows the different speakers to be separated easily into different files. If you record this way - please leave us a note that this approach was used.
Position of the mic
A unidirectional mic rejects any sound coming from its rear, so when recording, the speaker is ideally facing the mic, and the mic is facing away from any noise sources in the room (sound leaking window, air-conditioners).
Sitting next to any sound reflective surfaces can also degrade the recording (for e.g. sitting next to a window, or a concrete wall). Dry wall is less of a problem but a good rule of thumb is to leave at least 2 feet of distance from any wall.
To help minimize the effect of reflection, ideally choose rooms that have walls made of dry-wall, gypsum board, MDF (Medium Density Fiberboard), or unpolished wood. Other walls draped with curtains also work well to minimize echo. It's very important to minimize any stone or glass surfaces like glass or stone surfaced tables.
If this is unavoidable, a simple trick would be to try and cover any glass, stone or polished surface with a thick sweater, scarf, or blanket.
If you have a unidirectional mic at hand, it can help a little to point the rear of the mic to this surface as well.
External noises bleed into recordings from what are known as 'flanking paths'. Anything that allows air to leave or enter the room can be a flanking path, but we only need to be mindful of the widest/loudest. Listening through a pair of headphones while pointing your unidirectional mic around the room might help you locate a flanking path and allow you to tackle it before you record.
Air-conditioners can be a major problem, so it's best to set the room to a desired temperature an hour or so before recording and switch off the AC or Fan while you record.
Recording softwares typically have a Level Meter. It's best to turn up the preamp knob for the mic (sometimes called gain, or volume) until your loudest speaking volume achieves -6 to -3 dB on this meter.
It is unadvisable to exceed 80% of the full gain of the pre-amp. Cheap recording hardware can introduce noise beyond these limits.
Interviews / Multiple Speakers
In some professional productions, a sound-mixer may be present. In such situations where there are multiple speakers present - it is usually practice to have the main talent recorded on the left or right channel of a Stereo audio file, and any/all irrelevant speakers (interviewers, anchors, etc) recorded on the other channel. This allows a clean separation of the talent's voice when needed without complicating the format too much. A sound professional will usually be able to provide this format if requested, even after the recording day has gone by.
When there are more than 2 speakers, it is highly advisable to have a location recordist mic each speaker and record in a poly-wav format. This allows any/all of the recorded voices to have their own separate audio file, with minimal bleed of sound.
Audio that goes through a sound house will very likely involve some kind of sweetening process that prepares the voice for release on different media (radio, tv, podcast, youtube). Such processing can make two sets of data for the same speaker sound slightly different.
If the sound house has retained a copy of the unprocessed original audio, this is our best case scenario. One can always process the cloned voice audio to suit the requirement.
For good voice clones, we currently prefer audio that isn't processed by any of the following:
- Analogue Emulation softwares or Exciters
- These are typically used to add warmth, presence and character to voices.
- Compressors and Equalizers:
- Compressors squash the dynamic range making it harder to isolate the voice from any background noise.
- Equalizers can change the audio to sound more clear, bassy, husky, or crisp. This can be a powerful tool to match different recordings or different mics, but can also introduce unwanted variation between recordings.