Remember HAL, a computer which famously chatters away in a human like voice in Stanley Kubrick’s movie A Space Odyssey? The one in which at the end of the story an astronaut takes HAL apart and it breaks into a dejected commentary of the song Daisy Bell! Today speech synthesis with text to speech software is no more a science fiction. One of the earliest examples of articulatory speech synthesis can be dated back to 1769. An Austro-Hungarian inventor Wolfgang von Kempelen developed the world’s first mechanical speaking machine that generated crude human voice like noises using bellows and bagpipe components. Years of experiment and developments have refined it so much so that text to speech software has created its place in day to day applications. Many of you must also have tried some convenient Text to Speech Software like those provided by TTS-Soft.

For sure you must have wondered how Text to Speech Software reads aloud literally every written word into ones you can essentially hear. The complete process can be broadly simplified into three stages where it first converts text to words, words into phonemes and finally phonemes into sound. Wondering what phonemes are? For now, they are sound components that any spoken word can be constructed. Let us read through to understand the process.

  1. Pre-Processing/Normalization/Text to Word with Text to Speech Software
  • The same written word can have multiple meanings creating ambiguity, so it’s important to understand the meaning in order to read it correctly. Preprocessing is about narrowing down the many different ways one could read a piece of text into the one that’s most appropriate.
  • To follow the sense of what’s written and figure out the pronunciation that computers use, statistical probability techniques or computer programs structured like arrays of brain cells that learn to recognize patterns (neural networks) to arrive at the most likely pronunciation instead are used. This includes numbers, special characters, currency symbols, dates, times, abbreviations, and acronyms.
  • Words pronounced in different ways according to what they mean, text to speech software has to figure out the preceding text is in what tense, by recognizing verbs. Thus it has to handle homographs as well.


2.Synthetic Analysis + Phonetization + Prosody

  • In an effort to reproduce the natural sound of language, text to speech software has to go through a series of texts which contain every possible sound in the chosen language in the form of recordings. These recordings are further fragmented and structured to create a database. It basically forms an acoustic database containing segments of recorded speech containing: syllables, diphones, words, morphemes, phrases, and sentences.
  • Next, the text to speech software executes a sophisticated linguistic analysis to transpose written text into phonetic text.
  • To provide rhythm and intonation to a sentence TTS uses something called prosody-grammatical and syntactic analysis. It empowers the system to define the way each word needs to be pronounced so as to reconstruct the sense.


3.Text to Speech Software And Unit Selection/Phonemes to Sound

  • There are three different approaches followed in order to convert phonemes to sound: concatenative, formant and articulatory synthesis.
  • Concatenative: Computer can rearrange the little snippets of human sound in an infinite number of combinations to create entirely new words and sentences. It’s the most natural sounding but limited to single voices.
  • Formant: It combines 3–5 key frequencies of sound that the human vocal instruments generate to make the sound of speech. It can create absolutely any sound from scratch and change the voice gender to male, female or child.
  • Articulatory: It is the most complex approach combining mechanical, electrical, and electronic components that create the realistic and humanlike voice of all three ways.
  • Finally, the system generates the tone and the required length of the pronunciation by relating the phonetic writing thus ending the analysis part. TSS or text to speech software selects then the best units from the acoustic database to generate the desired sound.

For the last decade or so neural networks have been applied in speech synthesis and are quite promising, but still, need to be sufficiently explored. The majority of text to speech software is capable of interpreting text and outputting voice in an intelligent manner, however, is yet to be developed a handling potential for a wide spectrum of human intonations. Quite complicated and sophisticated methods/algorithms are implemented in modern text to speech software. We have tried to abridge the whole working of text to speech software in simple language. No matter what text to speech software you are using, we hope that you have understood the basic process behind it and going forward you would no more wonder how it works.