"Words mean more than what is set down on paper. It takes the human voice to infuse them with deeper meaning."
There was a recent study* that tried to understand how audio quality affected the perceived quality of the human voice. The researchers understood from the beginning that the results could be highly subjective, but they approached it using measurable methods. While tallying up the results, they were surprised by one finding they weren't attempting to measure. But it's something we in the advertising and production business already knew - more on that later.
In the study, they had a mix of professional engineers and laypeople listen to a variety of male and female voices. In completely random fashion, they played voices that were processed with/without reverb, compression, equalization, etc. They made sure there was no A-B comparison of samples with and without processing, which would skew the study.
The researchers used different ways to play back the recordings, such as trying to make the voice sound as if it's in the same room as the listener, adding short reverb to create a false environment, using different volume levels, and degrading the sound to telephone quality.
Just like the recent Kentucky Derby which had a tight finish, the results were close as well. Listeners heard very little difference in quality between voices that were and weren't highly processed. One interesting bump in the results was the preference for voices recorded in completely dry (no room bounce or noise) environments.
What I gleaned from the study is that it doesn't matter what the audio quality of the voice being played back is - it's the person doing the speaking. Huh, imagine that. It does matter who you pick to read the words. This reminds me of what I learned while studying music in college - most instruments that play the melody should try to sound like the human voice, because the human voice is the original musical instrument.
We can be fooled into thinking an engineered or synthesized sound is real if we aren't intimately familiar with it. We synthesize strings, keyboards, percussion, and other instruments. Some theaters still use sheets of metal in the rafters to simulate thunder. Sound designers use all kinds of tricks to create sounds in movies like stabbing a watermelon to fake a stabbing, patting bags of corn starch for walking in the snow, and breaking a piece of celery instead of a real femur.
But faking the human voice is the holy grail. We are so familiar with the nuances of not only our own voice, but of any human, that it's hard to pull one over on us. Software engineers have come close, but are still miles away. As artificial intelligence improves, so will voice synthesis. But one can only go so far with mimicking inflections, pauses, and breaths. You can't engineer emotion.
Even though Douglas Rain voiced the HAL 9000 computer in 2001: A Space Odyssey with a monotone delivery, you could still hear emotion. If they had somehow synthesized his voice (not very easy to do in 1968), HAL might not have seemed so sinister (or innocent).
So what was that surprising find in the voice study? Listeners clearly preferred male voices over female voices. Without opening that can of worms labeled "Gender Equality," we know that advertisers and producers have long favored the male voice to deliver their message. So it begs the question of why we prefer the sound of a male voice? Is it because of its tonal quality, or because we are used to a male voice delivering messages to us in media?
* "On the Assessment of High-Quality Voice Recordings including Voice Postprocessing"
JOHN G. BEERENDS, AES Fellow (TNO, P. O. Box 96800, NL-2509 JE, The Hague, The Netherlands)
and IMRE BEERENDS (Mantis Audio, Wateringen, The Netherlands)
Did You Know?
- The computer's voice in 2001: A Space Odyssey was originally supposed to be a female (Athena), but was changed to match the computer's new name HAL, drawn from the words heuristic and algorithmic.
- The first success at speech synthesis happened in 1779 when Christian Kratzenstein of the Russian Academy of Sciences built models of the human vocal tract that produced five vowel sounds (a,e,i,o,u).
- In the 1930's Bell Laboratories developed the vocoder, which analyzed basic speech sounds (fundamental tones and resonances).
- From the vocoder came the voder, the first human voice syntheizer. It was shown at the 1939 World's Fair by Bell Lab's Homer Dudley.
- Bell Labs once again led the pack with the first successful computer-generated voice synthesizer in 1961. Their IBM 704 sang "Daisy Bell."
- Author Arthur C. Clarke was visiting Bell Labs in 1961 and heard the "Daisy Bell" demonstration. He had the HAL 9000 computer sing it in 2001: A Space Odyssey.
- Texas Instruments' Speak & Spell from 1978 is an early example of a handheld speech synthesizer.
- The first video game with speech synthesis was Stratovox from Sun Electronics in 1980.
- One study** found that people listening to a voice recording can tell whether or not a person is smiling.
- British scientist Stephen Hawking has used a speech synthesizer since the mid-1980's. The original computer only had an American accent. Hawking's current speech synthesizer retains the same voice, because he now identifies with it.
- Vocal cords are also known as "vocal folds" and "vocal reeds."
- Vocal folds vibrate when air passes over them (during an exhale). Longer folds vibrate at lower frequencies (male voices) than shorter folds (female voices).
- Vocal cords produce harmonics from collisions of the vocal cords with themselves.
- Female singers, from contralto to soprano cover the range from F3 (the F below middle C on a piano) to C6.
- Male singers, from bass to tenor, cover the range from E2 (almost 2 octaves below middle C) to C5.
- The highest demanded note in classical repertoire is a high G6 (in a Mozart aria and a Massenet opera), about two and a half octaves above middle C.
- The lowest demanded note in classical repertoire is a D2 (in a Mozart aria), almost two octaves below middle C.
- Tim Storms holds the Guiness World Record for the largest vocal range at ten octaves.
** (Amy Drahota and colleagues, University of Portsmouth, UK)