The Verge racconta come si arriva a raggiungere la naturalezza della voce sintetica di Siri, l'assistente personale di Apple, partento da lettori umani e software di riconoscimento text-to-speech.
Ward and the company's senior design lead, David Vazquez, are part of the team working out of Nuance's Sunnyvale, CA offices creating next-generation synthetic voices. They describe their work as "part art, part science."
The text-to-speech industry is extremely competitive, and highly secretive. Even though it's universally believed that Nuance created the voice of Apple's talking personal digital assistant, Siri, Ward, and Vazquez coyly change the subject when asked.
That said, they've agreed to explain, at least in broad strokes, how they build voices. Needless to say, one doesn't start by recording every single word in the dictionary. But when you're talking about an application that reads any news story that comes into your RSS feed, or looks up stuff on the web for you, it needs to be able to say every word in the dictionary.
"Just say you want to know where the nearest florist is," Ward says. "Well, there are 27 million businesses in this country alone. You're not going to be able to record every single one of them."
"It's about finding short cuts," says Vazquez, a trim, bearded man who exudes a laid-back joviality. He rifles through a packet of stapled together papers that contains a script. It doesn't look like a script in the Hamlet sense of the word, but rather, an Excel-type grid containing weird sentences.
Scratching the collar of my neck, where humans once had gills.
Most of the sentences are chosen, says Vazquez, because they are "phonetically rich:" that is, they contain lots of different combinations of phonemes. Phonemes are the acoustic building blocks of language, i.e.: the "K" sound in "cat".
"The sentences are sort of like tongue twisters," says Vazquez. Later, a linguist on his team objects to his use of this expression, and calls them "non sequiturs."
"The point is, the more data we have, the more lifelike it's going to be," says Ward. The sentences, while devoid of contextual meaning, are packed with data.
After the script is recorded with a live voice actor, a tedious process that can take months, the really hard work begins. Words and sentences are analyzed, catalogued, and tagged in a big database, a complicated job involving a team of dedicated linguists, as well as proprietary linguistic software.
When that's complete, Nuance's text-to-speech engine can look for just the right bits of recorded sound, and combine those with other bits of recorded sound on the fly, creating words and phrases that the actor may have never actually uttered, but that sound a lot like the actor talking, because technically it is the actor's voice.