Voice generation using text: A deep-learning method

By Aditya Abeysinghe

Cable Bridge over Kelani River - major engineering feat By Aditya Abeysinghe Using text to generate speech similar to human voice is the main function of a text-to-speech (TTS) system. The process of converting text to speech is known as speech synthesis. Speech recorded is used to generate new speech, based on the input of the TTS. Since 1960s, several TTS systems have been developed for speech synthesis for current systems. However, these systems have several issues which led to the use of deep learning methods to synthesize speech.

Current methods

Two main methods exist for speech synthesis in traditional systems: concatenative and parametric. In concatenation-based synthesis the waveforms in the speech are concatenated to produce a speech stream. This type uses a waveform database to store and retrieve recorded speech. The speech appropriate for each text supplied is selected and joined to the stream to produce the final speech. In parametric speech synthesis, digital signal processing methods synthesize speech. Different parametric types use parameters such as phonetics and noise that are varied with time to create a waveform. Other techniques use deep neural or hidden models to produce waveforms.

The process

The first stage of speech synthesis is to analyze text input to the TTS. This involves text tokenization and removing blank characters. It breaks sentences into tokens and then sends them to the next stage, linguistic analysis. In this stage phoneme, syllable and words are analyzed in a text-to-phoneme conversion. Then the parameter prediction module predicts acoustic feature parameters of the linguistic analysis. Then the speech synthesis module produces the speech waveform.

Voice generation using text A deep-learning method

Issues in synthesis methods

The main issue with non-deep learning methods is that they are not efficient in processing complex text input that involve decision logic. Also, non-deep learning methods divide the input into separate regions and use separate parameters for each region. This results in fragmenting the data input to the TTS which causes improper models created. Therefore, the accuracy of such models to produce speech on test data is low.

Deep-learning methods in synthesis

Deep-learning methods use artificial neural networks for speech production. Therefore, models created using other methods are replaced with these neural networks, during the process between linguistic analysis and parameter generation as described in the process section. In a deep-learning neural network, linguistic features are processed using hidden layers of the network. The network is trained where error at each time is minimized by adjusting inner parameters or weights of the network. Therefore, the second issue as described where the division of input data in other methods is reduced. Also, complex logic can be represented using deep-learning method which reduces the first issue.

Image Courtesy: https://www.voicebase.com/